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MENTAL TESTS AS INSTRUMENTS OF SCIENCE 


CHAPTER [| 


CONFUSION IN INTERPRETING MENTAL TESTS 


S A MOVEMENT significant in human 
A aftairs achieves prestige and influ- 
ence, it often becomes increasingly sub- 
ject to misinterpretation and miscon- 
struction, not merely by its opponents 
but also by many of its nominal friends. 
Usually its creators and leaders have too 
little time in which to protect the move- 
ment from these misunderstandings, and 
soon the movement is characterized by 
over-generalizations, spurious imitations, 
and especially confusion in basic termi- 
nology. Possible examples of this tendency 
range from Christianity through laissez- 
faire economics to Progressive Educa- 
tion. Such movements have needed, from 
time to time, re-examination and clarifi- 
cation in order to consolidate real 
achievements and to open the way for 
future progress. 

The mental testing movement is no ex- 
ception to the above remarks. Gaining 
momentum in the early years of this cen- 
tury under the leadership of Cattell, 
Thorndike, and Terman in the United 
States, it reached imposing eminence in 
the 1920’s. But during the 1930's it has 
shown increasingly the signs of misun- 
derstanding, misinterpretation, and mis- 
use which apparently characterize a ma- 
ture movement. A few of these signs are 
worth noting here. 

There is a growing tendency in our 
schools away from reliance on relatively 
narrow test measurement toward con- 
cern for the broader problem of compre- 
hensive evaluation. This is a matter of 
fact, but the several conflicting interpre- 


tations of what this trend signifies reveal 
an obvious state of misunderstanding 
and confusion. There are those who see 
this trend as an emancipation from the 
static and atomistic uniformities of test- 
ing and who would dispense almost en- 
tirely with test measurement in favor 
of evaluation (17, 299). Others consider 
this trend a threat to a science of educa- 
tion and would throw out qualitative 
evaluation in favor of quantitative 
measurement by tests (2, 119). Still 
others hold that test measurement is 
really one important kind of evaluation 
(39, 91; or 61, 433). The conflict is prob- 
ably more fundamental than a question 
of definition; it appears to involve di- 
vergent interpretations of the nature and 
function of mental tests as well as of 
measurement and evaluation. 

Another source of disagreement and 
confusion is the thoroughly unstandard- 
ized meaning of the term “measurement” 
in psychological testing. At least the fol- 
lowing conceptions of measurement are 
fairly common in the literature of test 
research: 


1. Measurerrent conceived as the designa- 
tion of quantities by either cardinal or or- 
dinal numbers. 

2. Measurement conceived as the designa- 
tion of quantities by only cardinal numbers, 
which are submissible to arithmetical calcula- 
tion, thus excluding ranking as a form of 
measurement. 

3. Measurement conceived as simple count- 
ing, regardless of variations in quantity among 
the items being counted. 

4. Measurement conceived as the assign- 
ment of ordinal numbers to objects to indi- 
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cate their rank order of worth to some end. 
Each number represents the worth of the 
object as a whole, and is in no sense the sum 
of less worthy objects. 

5. Measurement conceived simply as the 
assignment of numbers to data. 

6. Measurement by tests conceived as in 
general the same as measurement in the physi- 
cal sciences. 

7. Measurement by tests conceived as con- 
stituting an entirely different order of precis- 
ion from the measurement of the physical 
scientist. 


It may be assumed that specialists in 
the field of psychological testing are, in 
most cases, cautious to see what each 
research study means by the term “mea- 
surement” before attempting to inter- 
pret, manipulate, or apply the recorded 
results. However, the usual teacher or 
layman, impressed as he is by the 
astounding achievements of measure- 
ment in the sciences, is given little pro- 
tection from accepting or making false 
interpretations of the results of “scien- 
tific measurement” by many of the tests 
now current. 

A third source of confusion and con- 
flict relates to the appropriate function 
of psychological tests. Are they tools of 
a basic science of psychology? Or are they 
rather practical devices for evaluating in 
particular situations? Or are they both? 
While there is an abundance of evidence 
that psychological tests are employed for 
both purposes—often a test-builder will 
use the same test in both contexts—the 
opinions of psychologists continue to dif- 
fer widely on this practice. Some look 
upon psychological tests as very useful 
diagnostic .instruments for immediate, 
practical purposes, like grouping pupils 
or selecting the best applicant for a job, 
but as unsuited to the discovery of those 
facts, principles, and laws. characteristic 
of a basic science of psychology. Others 
not only hold that psychological tests are 
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one of the most important methods of 
psychological research, but contend that 
psychology as a science consists pretty 
largely of the knowledge obtained by 
these tests.1 Most representatives of this 
latter viewpoint agree that tests are also 
applicable to particular problems of eval- 
uation, a position which is in line with 
the pragmatic theory that the scientific 
difference between research on the atom 
and the construction of a bridge is only 
one of degree. However, to recognize 
that the difference is one of degree is 
not to say that the two enterprises pro- 
duce results of identical significance. It 
is unfortunately true that a number of 
test experimenters, while admitting the 
difference in degree between basic psy- 
chological research and practical school- 
room evaluation, have tended to ignore 
the difference by attributing to evalua- 
tional results the same prestige and sig- 
nificance we have learned to give to basic 
scientific facts. Apparently when the 
difference between two things is only one 
of degree instead of kind, the difference 
is easy to overlook. In the field of psy- 
chological testing this has, on the one 
hand, invited unwarranted generaliza- 
tions from certain test results and, on the 
other hand, obscured the need for a 
critical examination of the particular 
purposes for which these tests were built 
before using them or their norms in 
other contexts. 

One result of these conflicting view- 
points and interpretations has been the 
reappearance of considerable suspicious 
criticism of psychological testing and, to 
some extent, a reaction against it. While 
much of this criticism is probably un- 
fair, misconceived, or based on sentimen- 
tal objections to the implications of the 
movement, some is undoubtedly accurate 


‘For a symposium of views, see Terman (51). 





MENTAL TESTS AS INSTRUMENTS OF SCIENCE 3 


and justifiable, even though it has been 
piecemeal instead of thorough-going. 
What appears to be needed is a search- 
ing examination and clarification of the 
foundations of the psychological testing 
movement in order to give a basis for 
substantiating its achievements and elim- 
inating some of the false claims made by 
both its supporters and its critics. 

To meet completely the need, such an 
examination should cover the whole 
range of test uses from basic scientific 
research to such immediate problems of 
evaluation as rating teachers by the per- 
formance level of their pupils. Because 
of the long-standing .association of men- 
tal tests with the term measurement and 
the science of psychology, it is obvious 
that the examination should begin with 
the place of tests in scientific research and 
then proceed to their function in matters 
of evaluation. If the first step had ever 
been adequately done, the writer would 
really prefer to start at once with the 
question of evaluation, for this topic is 
just entering a period of major signifi- 
cance in educational practice. 

The place of tests in scientific research 
has received some passing notice. Dewey’s 
recent book, Logic (11), provides an ex- 
cellent comprehensive treatment of the 
theory underlying the methodology of 
science and evaluation. The American 
Council on Education has published a 
booklet on “Educational Research” (1), 
in which a chapter on test measurement, 
based in part on the original material 
of this monograph, outlines the problem. 
B. O. Smith has recently published a 
study (44) of achievement testing in 
terms of the logical assumptions of 
measurement, thus contributing consid- 
erable ground-clearing to one aspect of 
the problem. A thorough-going analysis 
of mental testing from the point of view 
of a basic science of psychology, however, 


remains to be undertaken. This problem 
will therefore be set as the theme of this 
monograph, and a consideration of tests 
as instruments of evaluation will be post- 
poned for later treatment elsewhere. 

That tests.represent some sort of scien- 
tific measurement has been claimed with 
varying reservations ever since McCall, 
interpreting Thorndike, said: ‘“‘Measure- 
ment in education is in general the same 
as measurement in the physical sciences”’ 
(28, 5). While it is currently granted that 
much testing which goes under the name 
of measurement is not truly scientific 
measurement as it is operationally de- 
fined in the natural sciences, some types 
of psychological tests, especially those 
which test intelligence, are commonly 
accepted by reputable experimenters as 
genuine measuring devices. Indeed, some 
psychological tests must be regarded as 
giving an accurate index to fundamental 
human abilities if these tests are to make 
reliable contributions to a science of 
psychology. 

That tests are intended to serve the 
purposes of a science of psychology is a 
rather frequent contention of certain 
experimenters. As early as 1924, when 
twenty-two outstanding psychologists in 
the United States were polled in regard 
to what contributions psychological tests 
had been or were making to the science 
of psychology, a number of them listed 
such topics as mental organization, men- 
tal types, the nature of intelligence, and 
the effect of heredity and environment 
(50, 116-17). At that time Cattell, one of 
the early leaders of the test movement, 
claimed: “Psychology as a science consists 
largely of the knowledge obtained by 
quantitative psychological tests. This 
knowledge is in itself of substantial im- 
portance and is interrelated with the 
whole field of psychology. It is as ulti- 
mate, both as description and in its ap- 
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plication, as any other part of psychology” 
(50, 117). Terman added: “It [the mental 
test] has become one of the important 
methods of psychological research; some 
would say, the most important. Not the 
least of its contributions is the fact that 
it has broadened and intensified our in- 
centives to research...” (50, 117). In 1927 
Thorndike published his monumental 
work, The Measurement of Intelligence, 
complete with a scientific theory of in- 
telligence and the procedures of measure- 
ment. In more recent years, Kelley (21) 
and ‘Thurstone (60) have made active 
proposals for the use of tests in the scien- 
tific description of elementary human 
abilities. At the present time factor an- 
alysis, a statistical treatment of the cor- 
relation between test scores in search of 
the unitary psychological abilities in- 
volved, is being vigorously pursued by 
many psychologists both in the United 
States and in Great Britain. Moreover, 
the papers published in the thirty-ninth 
Yearbook of the National Society for the 
Study of Education (35) leads one to be- 
lieve that the relative weights to be as- 
signed nature and nurture in determin- 
ing intelligence depend heavily on the 
constancy or inconstancy of the IQ as 
derived from current tests. 

The above citations are sufficient evi- 
dence that many experimenters are seri- 
ously engaged in using psychological 
tests to discover something about the 
universal nature of human capacities and 
powers. If certain test results are to be 
interpreted as revealing this kind of 
knowledge,. then experimenters with 
tests may be expected to observe the 
procedures, controls, and verifying meth- 


ods commonly accepted in modern 


science.” By a careful examination of the 


logic of scientific methodology, with 
special reference to the field of psychol- 
ogy, we may establish a basis for clarify- 
ing the current issues and disputes on the 
scientific status of testing and test results. 
With the completion of this task, we 
should be in a position to indicate in 
outline the logical conditions and pros: 
pects of psychological testing in a science 
of psychology. A systematic analysis and 
criticism of the extent to which recent 
psychological testing is a form of scien- 
tific measurement or contributes to a 
basic psychological science will therefore 
be the principal aim of this study. 

This aim, it should be noted, automat- 
ically eliminates from consideration in 
this volume such instruments as achieve- 
ment tests, quality rating scales, and in- 
terest inventories, for these are clearly 
intended to be evaluation aids for special 
purposes. The kind of psychological tests 
under consideration here will be those 
intended to reveal the nature and 
amount of fundamental human apti- 
tudes and abilities, such as intelligence. 

In order to focus the succeeding dis- 
cussion, the field of psychological testing 
in aptitudes and abilities will be subject 
to the following delimitation: the type 
of tests to be considered is the objective 
paper-and-pencil form which contains 
within itself the stimuli to the behavior 
desired and which receives the record of 
the response. This is not only the most 
widely used form of psychological test, 
but also, for the purposes of this study, 
adequately illustrates the essential char- 
acteristics of test construction, test uses, 
and interpretations of test results in this 
area. 


* “Psychological theory can be rigorous,” says 
L. L. Thurstone (57). 





CHAPTER II 


ESSENTIAL CHARACTERISTICS OF A SCIENCE 


EFORE the essential characteristics of 
B a science are developed and defined, 
an important distinction between science 
and scientific method should be recog- 
nized. A science, as the term is to be 
used here, employs the scientific method 
exclusively. The method, however, is not 
restricted to science. It is logically appro- 
priate to any area of human inquiry— 
civil engineering, educational evalua- 
tion, socio-economic problems, moral 
questions, and the philosophical inquiry 
of experimentalism. Being unlimited in 
its applicability, scientific method is to 
be contrasted only with rival methods, 
such as arbitrary authoritarianism, mys- 
ticism, magic, or philosophies claiming 
sources of absolute knowledge. 

The main principles of the scientific 
method of inquiry are so well known as 
to require only a summary here. John 
Dewey has recently made them the sub- 
ject of an exhaustive analysis and exposi- 
tion (11). The major stages in scientific 
inquiry may be outlined as follows: (1) 
Inquiry is anteceded by an indetermi- 
nate situation, one which arouses doubt, 
confusion, or conflict for a person’s sub- 
sequent behavior. (2) Inquiry begins with 
attempts to locate the proximate nature 
of the problem in this situation, and to 
define it in terms of some end-in-view. 
(3) After a preliminary but careful exam- 
ination of the familiar elements in the 
problematical situation, possible solu- 
tions are devised and selected. (4) These 
possible solutions are subjected to con- 
trolled testing (including both anticipa- 
tory reasoning and actual performance). 
(5) From this process is determined a 
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solution-pattern which makes the total 
situation now determinate, understood, 
available for subsequent use and enjoy- 
ment. 

The wide applicability of the scientific 
method of inquiry establishes, of course, 
a continuity between science-as-such and 
any other area of human activity where 
inquiry is undertaken, and the need for 
greater recognition of the common ap- 
plicability of this method (especially in 
social inquiry) can scarcely be over-em- 
phasized. However, the use of a common 
method in various types of endeavor does 
not mean that the various results 
achieved are equally exact or generalized 
or “scientific.” Obviously the nature and 
content of the results achieved will be 
extensively affected by the ends-in-view, 
the subject matter dealt with, the tech- 
niques available, and so on. As a con- 
sequence, it is to be expected that the 
enterprise of science will have charac- 
teristics distinguishing it from such other 
enterprises of inquiry as, say, engineer- 
ing or educational evaluation. 

The principal characteristic distin- 
guishing the work of the sciences is the 
way in which the respective subject mat- 
ters are treated or, to put it more ex- 
plicitly, the way in which the problems 
studied by the sciences are defined. For 
instance, the study of particular events 
as such is not a problem of science, but 
the study of the constant relations be- 
tween particular events is (7, 37). More- 
over, the kind of relations studied by 
science are those which may be so gener- 
alized as to be independent from any one 
set of events and apply to all cases of this 
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kind of events. “The generality of all 
scientific subject-matter as such means 
that it is freed from restriction to condi- 
tions which present themselves at partic- 
ular times and places. Their reference 
is to any set of time and place conditions” 
(11, 117). 

The unique character of the prob- 
lems studied in a science is accordingly 
reflected in the kind of ends or products 
sought. Thus, the characteristic end- 
products of research in science are gener- 
facts, confirmed laws, verified 
theories. To achieve such products, the 
pervading purpose of scientific research 
becomes thorough-going objectivity (the 
purpose described by classicists as scien- 
tific Truth)—i.e., what actually happens 
controlled conditions as far as 
possible independent of any immediately 
practical human purpose. “The formu- 
lations of science, internally isolated, are 
statements invariant with respect to the 
individual” (25, 189). In short, a scien- 
tific problem is defined so as to gain uni- 
versal agreement on the purpose of the 
research and on.what kind of results are 
acceptable. This high degree of agree- 
ment is obtained primarily through the 
objectivity of the physical operations 
which are employed in the solution. Ex- 
actness to a degree appropriate for the 
use of cardinal numbers is sought, and, 
as most scientists agree, exactness re- 
quires a high degree of isolation of the 
system under investigation. All these 
characteristics of research in science be- 
come involved by the nature of the kinds 
of problems formulated by the sciences, 

The distinctions here offered between 
typically scientific problems and other 
problems of more particular and imme- 
diately practical import conform with 
customary usage, but the point should 
again be stressed that the scientific 
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under 
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method of inquiry is in no sense limited 
to problems ‘of science but rather is ap- 
propriate and important to all kinds of 
problems. These distinguishing charac- 
teristics of the problems of science are 
offered primarily to delimit the central 
problem of this monograph—the place 
of mental testing in a science of psychol- 
ogy. Psychological problems in the broad- 
est sense range from questions of interest- 
ing Mary in arithmetic to isolating the 
basic mental factors involved in arithmet- 
ical computation. All of these should, 
of course, be attacked in the method- 
ology of scientific inquiry. But in this 
treatise, mental tests will be considered 
only in their adequacy to deal with prob- 
lems characteristic of a basic science of 
psychology—e.g., the laws of learning, a 
theory of intelligence, or the elementary 
causes of specified types of performance. 

Science has so far been distinguished 
from other types of inquiry by the kind 
of problems studied and the kind of out- 
comes sought. In most psychological re- 
search which purports to be scientific, 
there is very little disagreement as to 
what kind of problems are scientific 
problems. But certain misconceptions 
have arisen of what science is as a body 
of knowledge and of how that knowledge 
is validated. These misconceptions have 
led to considerable confusion and am- 
biguity in the results of psychological 
research, particularly in testing. Conse- 
quently, the remainder of this chapter 
will examine in some detail the essential 
characteristics of scientific knowledge. 
These will be grouped under three head- 
ings: (1) the nature of scientific proposi- 
tions, (2) the nature of scientific laws, 
and (3) the nature of scientific theories. 
This examination will provide many of 
the necessary criteria for appraising the 
uses of mental tests in scientific research. 
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SCIENCE AND MATHEMATICS 

Most scientists are very careful to dis- 
tinguish between science on the one 
hand and mathematics or logic on the 
other (mathematics being a highly sym- 
bolic form of logic) (40, 13; 7, 32). The 
basis of this distinction is to be found in 
the two kinds of propositions employed 
in scientific inquiry. One kind is de- 
signed to describe existential facts in the 
actual world. Such propositions are 
sometimes called categorical, material, or 
existential propositions. They are de- 
fined by objective, physical operations. 
The achievement and verification of 
these propositions is the end-in-view of 
all scientific inquiry, for they express a 
determinate, understood situation. Any 
statement of fact, from Boyle’s law to 
“This book is green,” is an existential 
or categorical proposition. 

The other kind of propositions is de- 
signed to describe the “world” of possi- 
bility. Such propositions are called logi- 
cal, formal, or hypothetical. They are 
defined by logical or mathematical oper- 
ations. A geometric theorum or an alge- 
braic equation is a hypothetical proposi- 
tion. These do not need or purport to 
have reference to actual existence, al- 
though they are very useful in inquiry 
into existence. They are free from any 
necessity of existential reference and 
from any particular existential subject 
matter, although any subject matter 
might, under appropriate conditions, be- 
come the content of these propositions. 

‘ The distinction between these two 
kinds of propositions has led _ philoso- 
phers and scientists in times past to set 
up a sharp dualism-in-kind between so- 
called purely mental concepts and the 
concepts intended to describe an existen- 
tial fact. This dualism, as Dewey (11, 
esp. Pt. IV) has shown, is totally unwar- 


ranted. On the one hand, the most ab- 
stract mental concept has its original 
source in the stream of actual concrete 
experience, even though many experi- 
ences may have been decimated and re- 
combined in novel patterns to create that 
concept. Logical concepts may have no 
single physical referent in experience, 
but they are nothing more than the 
products of refinements, reconstructions, 
and elaborations of the empirical data 
of many experiences. To separate them 
completely from the stream of experience 
would be to make them incomprehensi- 
ble. 

On the other hand, the most concrete 
physical description of an empirical 
datum has something of the abstract and 
general in it. An observation of similarity 
to another datum is to some degree a 
generalization, and the use of such terms 
as “blue” or “heavy,” which are abstrac- 
tions from many experiences, gives a 
conceptual or “mental” tinge to the con- 
crete description. 

However, the denial of a sharp dichot- 
omy between existential propositions 
and logical or mathematical propositions 
does not mean at all that the two are 
practically the same. Wide differences of 
degree can, and frequently do, separate 
the two. A logical concept or symbol is 
often so dimly related to the many as- 
pects of physical experience out of which 
it grew that it has no direct physical 
meaning—indeed, no physical meaning 
might be conceivable for it. Accordingly, 
the meaning of logical concepts or sym- 
bols is based on the logical or mathema- 
tical operations one must perform in 
arriving at them—for all practical pur- 
poses, their circuitous relationship to 
concrete experience may be ignored. 
Toward the other end of the scale, a 
physical concept, which is intended to 
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describe an empirical fact in experience, 
gets its meaning from the physical opera- 
tions one must perform in defining it. 
In this case, the conceptual abstractions 
that assist the physical description, 
though known to be present, can usually 
be ignored. 

Now obviously much data not only 
will satisfy the rules of logical and 
mathematical procedure but will also 
be the existential referents of physical 
operations. For a simple example, the 
statement 2 plus 2 equals 4 is only a 
logical proposition. The statement 2 
apples plus 2 apples equals 4 apples is 
a proposition which is both logically 
accurate and existentially verifiable. The 
proposition 2 minus 3 equals minus 1 is 
also logically accurate, but 2 apples 
minus 3 apples equals minus one apple 
is not an existentially verifiable proposi- 
tion. 

The significance or validity of logical 
concepts is far from being dependent on 
the correspondence of their content with 
existential facts. Their chief function is 
the instrumental one of ordering empiri- 
cal data through exceedingly complex 
transformations toward a set of possible 
solutions, one of which can be verified as 
the effective solution of the original 
problem. 

Up to this point the intent has been 
to differentiate but not separate existen- 
‘ial propositions and hypothetical propo- 
sitions. It is now appropriate to record 
in orderly fashion the essential charac- 
teristics which each kind of proposition 
is assigned, by general agreement, for 
unequivocal use in the process of scien- 
tific inquiry. | 

1. Four conventions are commonly ap- 
plied to existential propositions. First, 
they must be empirically known. The 
care with which this criterion is observed, 
even with such a familiar concept of 


| 


elementary physics as use esectric field, is 
illustrated in the following quotation: 


There can be no question whatever of the 
tremendous importance of the concept of the 
electric field as a tool in thinking about, de- 
scribing, correlating, and predicting the prop- 
erties of electrical systems; electrical science is 
inconceivable without this or something 
equivalent. . . . [But] it seems to me that any 
pragmatic justification in postulating reality 
for the electric field has now been exhausted, 
and that we have reached a stage where we 
should attempt to get closer to the actual facts 
by ridding the field concept of the implica- 
tions of reality (3, 58-59). 


Second, existential propositions must 
constitute objective knowledge—i.e., 
knowledge which has relations common 
to all men. This is equivalent to saying 
that these propositions must be inter- 
subjectively known, thus avoiding any 
fundamental dualism between. objective 
and subjective. 

Third, each existential proposition 
should be defined by a unique set of 
physical operations and consequently 
have a physical reference. This is essen- 
tially a definition of how the proposition 
is empirically known and _ intersubjec- 
tively known. It is not enough that the 
data in question be subject to direct ex- 
perience. People still find it too easy to 
make gratuitous assumptions about the 
nature of what they are directly experi- 
encing. A thing or process is empirically 
known (not merely experienced) in terms 
of the unique set of operations which in 
actual practice distinguishes it from any- 
thing else. The set of operations gives 
the objectivity or highly constant inter- 
subjectivity demanded in science. In or- 
der further to insure and preserve this 
important objectivity, particularly when 
any object or event is knowable by sev- 
eral sets of operations, each operational 
concept of the object or event should be 
kept clearly distinct (in written or spoken 
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reference) from the others. “If we have 
more than one set of operations, we 
have more than one concept, and strictly 
there should be a separate name to cor- 
respond to each different set of opera- 
tions” (3, 10). 

The above point may be illustrated by 
the concept of velocity in the physical 
sciences. The velocity of an object on the 
face of the earth is defined by the opera- 
tions of timing over a measured distance. 
The velocity of a’star is defined by opera- 
tions involving the displacement of light 
rays passing through a spectrum. The 
two concepts of velocity are scientifically 
distinct and operationally incomparable, 
except for possible cases within a narrow 
range where both sets of operations can 
be applied to the same phenomena. By 
strictly adhering to this difference be- 
tween concepts when different operations 
are required, scientists are able to feel 
assured that no future discoveries will 
invalidate current scientific knowledge. 
Illustrating this same principle in the 
measurement of intelligence, we find one 
definition expressed in the operations of 
an objective test and another in the op- 
erations of, say, a teacher’s judgment. 
The two concepts of intelligence are 
operationally distinct and as yet com- 
parable only hypothetically through the 
correlation of rank orders. 

A fourth convention of science to in- 
sure universal agreement on categoricals 
is the verification of hypotheses or pro- 
posed solutions under conditions where 
they might also be falsifiable. The test 
of the truth of a solution to a scientific 
problem is “what works,” but especially 
critical attention is given to “what 
works.” Knowledge of precisely what 
worked depends upon careful control 
of antecedent conditions so that the ex- 
perimental result achieved may be ac- 
curately related to, and only to, the 


actually effectual factors. All this is sim- 
ply the application of the logical con- 
ditions of affirmation (inclusion) and ne- 
gation (exclusion) (11, ch. 10). It is a 
reminder to the experimenter that a 
negative result may be due to unsus- 
pected conditions which he failed to con- 
trol rather than the falsity of his hy- 
pothesis, or that a positive result may be 
due to uncontrolled conditions other 
than those included in his hypothesis. 

2. Some of the essential characteristics 
of hypothetical propositions were men- 
tioned in the earlier identification of 
these propositions. Since they are con- 
cerned with formal relationships regard- 
less of particular content, they are some- 
what like a string of empty railroad cars 
which can be shifted from track to track 
and train to train according to certain 
rules of the switchman. Almost anything 
can be loaded into them, providing the 
load will fit the contours of the cars, and 
by following the rules of the switchman, 
just about any kind of train can be made 
up. The concern of the shipper is to se- 
lect from all possible trains the kind of 
train which best promises to carry his 
goods to the desired destination. 

This crude analogy suggests several 
important features of logical or mathe- 
matical propositions and concepts. First, 
the switchman’s rules are the axioms of 
mathematics, which need be neither true 
nor false (existentially speaking) in them- 
selves but aré rather a set of postulates 
inter-consistent with each other. Since 
any consistent set of postulates is now 
permitted in mathematics, the choice of 
any one set by the scientist is conditioned 
only by the nature of the problem in 
hand and the need for a fruitful array 
of implied consequences from these pos- 
tulates. 

Second, since the postulates of logic 
or mathematics need be neither true nor 
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false, the possible implications or hypo- 
theticals which may be deduced from 
these postulates are also neither true nor 
false in their formal relationships. The 
ippropriate phrase is that they are 
thoroughly consistent with each other. 
lhus, we are reserving the terms “true” 
and “false” for existential propositions. 
Che justification for doing this will be 
offered through illustration. The classic 
statement of the implicatory relationship 
in logical propositions is, “If p, then q.” 
Now the letter p may stand in principle 
for absolutely anything: an existential 
concept, an imaginary concept, a false 
proposition, a winged horse, or a horned 
building. And similarly the letter q may 
stand for absolutely anything. The con- 
tent of p and q in no way affects the con- 
sistency or inconsistency of the p-and-q 
relationship with the postulates of logic. 
\nd conversely, this logical proposition 
in no way asserts the truth or existential 
reality of the contents of p and q. We 
have already noted that a mathematical 
proposition, though invaluable as a tool 
of science, is not a scientific statement of 
fact per se, nor can any amount of mathe- 
matical manipulation alone make it so. 
Scientific propositions must be proved 
experimentally, empirically, to represent 
in actual fact in the existential world, 
but mathematical propositions are never 
proved in this way. They must be proved 
onsistent by logical deduction from pos- 
tulated rules. Consequently, there is 
hardly occasion to speak of truth or 
falsity in connection with mathematical 
propositions as such, since the term “con- 
sistency” covers the question. This 
means, of course, that truth and falsity, 

limited to existential propositions, 
cannot be absolutes, for no existential 
proposition is utterly and unchangeably 
‘xact. Dewey’s phrase “warranted assert- 
ibility” (11, g) in place of truth for “ 


istential propositions is perhaps a safer 
term to use to avoid misinterpretation or 
misunderstanding. 

From this discussion there follows a 
third and final significant characteristic 
of mathematical propositions and con- 
cepts. Although they are non-descriptive 
in themselves of existential facts, they 
serve an exceedingly important instru- 
mental function in enabling scientific in- 
quiry to arrive at existential statements 
of fact. In itself, mathematics is the disci- 
pline of transformability; in scientific 
inquiry, it is the technique of trans- 
formation. It occupies a mediating po- 
sition between the facts defining the 
problem and the possible solutions, 
which, when tested, may resolve the prob- 
lem. To use an analogy, mathematics 
builds bridges over areas that empirical 
science by itself could never cross, but, 
of course, empirical verification is then 
required to establish that the other end 
of the bridge touches real, existential 
facts. 

While it is important to recognize the 
necessity of distinguishing between hypo- 
thetical and existential propositions in 
their scientific meanings and applica- 
tions, it is equally necessary to recall that 
a hypothetical proposition may also be a 
true existential proposition within an ac- 
ceptable margin or error. Most of the 
existential descriptions of physics have 
sufficient quantitative exactness for 
mathematical expression. This is un- 
doubtedly due in large part to the fact 
that problems are created in physics by 
abstracting out the geometrical relation- 
ships of events, and these relationships 
are capable of strict logical treatment. 
This fact also establishes the significance 
of measurement to a science, since nu- 
merical relationships permit the trans- 
forming use of mathematics. While nu- 
merical and geometric relationships are 
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rather more scarce in psychology, per- 
haps because of the difference in subject 
matter, it is reasonable to presume that 
the development of a science of psy- 
chology will be closely related to the 
increased establishment of such relation- 
ships. 


SCIENTIFIC LAWS 


The establishment of laws was noted 
earlier as a purpose which distinguished 
science from other types of human enter- 
prise. The use of scientific laws, of 
course, characterizes many enterprises, 
but the creation of these laws is peculiar 
to science. 

The propositions of especial concern 
to science are those which describe the de- 
terminate and constant relationships be- 
tween various kinds of existential objects 
or events. Speaking inclusively, these veri- 
fied propositions are the laws of a science. 
The kinds of relationships expressed in 
laws are often described as either associ- 
ational or sequential. It is important to 
notice that the subject matter of a law is 
not the events themselves, but the inter- 
action between the events. Thus, a sci- 
entific law has the effect of resolving two 
or more qualitatively different objects or 
events into a single continuous event, ex- 
pressed in a proposition of relationship 
(11, ch. 22), 

Scientific laws are sometimes differ- 
entiated according to whether they hold 
individually for each case or only statis- 
tically for, say, a certain percentage of 
the cases. The latter type of law can rep- 
resent, of course, only probability for 
the individual case. Not so long ago the 
only acceptable laws were those which 
were exceptionless in the sense of holding 
for each individual datum in the group, 
but in recent times there has been in- 
creasing recognition that all laws are 
statements of probability. Even the law 


without historical exception for the indi- 
vidual case can offer nothing more cer- 
tain than extremely high probability for 
the behavior of all future cases. But even 
though laws deal with probabilities, they 
still are required to express constant re- 
lationships—i.e., constant statistics on the 
relative frequency of the various possible 
events to be expected from certain spe- 
cific conditions. 

A more important differentiation of 
scientific laws is based on the fact that 
some have direct existential reference 
and that others are hypothetical con- 
structs with no direct existential refer- 
ence. Laws of the first type are equivalent 
to existential generalizations on the re- 
lationships of particular kinds of events, 
such as the fact that water boils at 212 
degrees F. at sea level. As was intimated 
above, such laws still have some hypo- 
thetical character about them, for al- 
though the law is true of all cases ex- 
amined so far, the future may reveal an 
exception. This observation, however, 
concerns the approximate nature of all 
scientific propositions and in no way de- 
tracts from the law’s standing as an ex- 
istential description of natural phenom- 
ena. 

Laws of the second type, expressed in 
hypothetical constructs, are inventions 
of the scientist to understand and con- 
trol a wide area of data where no existen- 
tial law of comparable effectiveness ap- 
pears possible. The chief advantages 
claimed for hypothetical constructs are 
that they (1) make laws more general 
(i.e., more widely explanatory), (2) make 
laws more exact, since all these constructs 
are developed in rigorously logical or 
mathematical terms, and (3) avoid irrele- 
vant existential data. The law of gravity 
is an example of a hypothetical construct, 
expressing in its equation such non- 
existential abstracts as mass and mutual 
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attraction. But regardless of the hypo- 
thetical nature of the law, it is framed 
with reference to the determinate solu- 
tion and control of existential problems, 
and consequently it is subject to valida- 
‘ion in an existential context. If its fruits 
fail to continue solving the scientific 
problems appropriate to it, the law will 
have to be reformulated or abandoned. 
\ case in point is the Einsteinian re- 
formulation of the old Newtonian law of 
gravity. 

The obvious purpose of scientific laws 
is to solve scientific problems, make these 
problems determinate, or, as we say, e€x- 
plain them. In essence, explanation con- 
sists in converting the unfamiliar into 
the familiar, usually by relating prob- 
lematical events to such an ordered sys- 
tem of occurrence that our scientific curi- 
osity rests. This, as we have seen, may 
be accomplished by generalizing, under 
specific existential conditions, certain 
constant relationships between events, or 
by creating a hypothetical construct by 
means of which certain existential phe- 
nomena may be ordered and controlled. 
in either case, explanation gives the 
power of prediction. Indeed, it is a scien- 
‘ific commonplace that the power to pre- 
lict is a good measure of the adequacy 
f£ the explanation. 

A special kind of explanation is caus- 
lity. It is not to be confused with a 
constant relationship of so-called se- 
juence between phenomena, though this 
relationship is often popularly referred 
o as “cause and effect.” Causality de- 
pends primarily on the possibility of in- 
stituting controlled variations in the 
means used to convert a problematic 
ituation into a determinate one, 1.e., to 
achieve the specific end-in-view. If dif- 
ferent means produce different conse- 
juences, those means are obviously dif- 
ferent causes. But if two different means 





result in apparently the same _ conse- 
quence, a further problematic situation 
is suggested, because the likelihood is 
that one or both sets of means are result- 
ing in wider and. different consequences 
that are escaping observation. 

A concluding word should be ap- 
pended concerning numerical laws—i.e., 
laws based on fundamental measure- 
ment. These laws may be either existen- 
tial generalizations or hypothetical con- 
structs. The following quotation testifies 
to their great significance in science: 

“These laws are recognized to have the high- 
est degree of probability of any known to us 
and are capable of being based upon a very 
small number of observations, provided they 
are properly carried out. . . . By means of 
measurement and numerical laws we can de- 
velop the ideas of functional relations which 
are eminently suited to mathematical treat- 
ment” (40, 107). 


SCIENTIFIC THEORIES 


In the course of scientific inquiry, any 
current stock of facts and laws is always 
inadequate to a complete set of explana- 
tory relationships for the field. Some 
facts seem to hang together but others, 
on first inspection, appear quite unre- 
lated. To get all these facts and laws into 
some comprehensive pattern, the scien- 
tist makes some shrewd guesses—some 
hypotheses, which potentially link dis- 
junctive facts and laws into a consistent 
system of mutual implication. This in- 
clusive pattern of relationships, some hy- 
pothetical and some verified, constitutes 
a scientific theory. 

Like scientific laws, scientific theories 
have as their main function the explana- 
tion of the existential world for our bet- 
ter understanding and control. Explana- 
tion, as has been noted, consists in 
understanding how to proceed in solving 
a problem, or achieving the power to 
predict the determinate conclusion that 
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a specified course of action will lead to. 
As this is the function of laws in par- 
ticular kinds of problems, so this is the 
function of theories in a much wider 
range of kinds of problems. For large- 
scale explanation, particular laws require 
inter-organization, and this is accom- 
plished through a scientific theory. 

In addition to being an explanatory 
system of known facts, a theory is also 
a very useful instrument in the prosecu- 
tion of further scientific inquiry. When 
the scientist faces an indeterminate or 
problematic situation in his field, the 
theory he has built and tested in this 
area serves as a guide in the examination 
of the material and the conditions of 
this situation. Examination in this broad 
perspective is especially important in the 
formulation of a clear-cut scientific prob- 
lem. Without the guidance of a theory, 
the analysis and discrimination which is 
necessary to ‘convert an indeterminate 
situation into a formulated problem is 
severely handicapped or, worse yet, fore- 
gone. As a consequence, the solutions 
proposed may not take sufficiently into 
account the full existential conditions 
to which the solutions are to be applied 
and in which they will take effect. Fur- 
thermore, new difficulties to a genuinely 
grounded solution are likely to be raised 
and old difficulties intensified. This point 
may be illustrated in psychological test- 
ing for certain human aptitudes, where 
it is often assumed that what aptitudes 
are is already sufficiently clear for scien- 
tific research. As a result, the market is 
flooded with tests (proposed solutions). 
Yet no one has a scientifically verifiable 
solution, and the difficulties of this ap- 
proach for a science seem to be multiply- 
ing rather than lessening. 

In view of the significance of a theory 
in scientific inquiry, some of the most 
important characteristics of a sound 


theory deserve brief discussion. As im- 
plied above, one measure of the worth of 
a theory is its inclusiveness of all the 
apparently pertinent facts and _ estab- 
lished laws in the field. In the process of 
formulating this inclusive pattern, two 
phases should be noted: (1) the scientist 
constructs hypothetical concepts and 
propositions beyond his actual knowl- 
edge, and (2) he selects these hypotheti- 
cal propositions with the aim of arriving 
at a total explanatory system in which 
his facts, existential generalizations, and 
hypothetical constructs are logically cor- 
related and free: from inter-contradic- 
tions. Thus a theory includes not only 
known facts and laws, but a set of basic 
postulates and a system of hypothetical 
propositions and concepts not yet veri- 
fied. Moreover, both the verified facts 
and laws and the hypothetical propo- 
sitions (in short, all the propositions and 
concepts in the theory) must be logically 
deducible from the postulates of the 
theory. 

The choice of postulates for a theory 
appears to be an arbitrary matter, and in 
the preliminary construction of a theory 
this is relatively true. But when alterna- 
tive sets of postulates are available, one 
set is as good as the other only if (1) 
the two sets are actually equivalent in 
the sense that nothing but identical hy- 
potheses for existential verification can 
be deduced from the two sets, even 
though the hypothetical forms and rep- 
resentations may be different; or (2) 
when the two sets are non-equivalent, the 
hypotheses deducible from either set 
have not been sufficiently verified to es- 
tablish definitely the existential falsity or 
inadequacy of either set. 

A further requirement of a sound 
theory is that it be fruitful of a large 
number of hypotheses which in principle 
may be used to help define and solve 
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other scientific problems in the field. In 
an anticipatory sense, this is equivalent 
to saying that the theory is instrumental 
to predicting and explaining in advance 
other laws not yet verified. It must be 
remembered that these new laws de- 
ducible from the theory are at first only 
hypotheticals. Their logical consistency 
does not establish, of course, their exis- 
tential validity; these propositions form 
the hypotheses for further research. They 
are candidates for verification. And upon 
their verification largely depends the fu- 
ture tenability of that theory. 

Some of the hypotheses deducible from 
the theory will be candidates for verifi- 
cation as existential generalizations. 
Others will be of the type called hypo- 
thetical constructs, which are not in- 
tended to have direct existential refer- 
ence and which consequently will not be 
verified directly. The test of their validity 
is their unfailing usefulness in accurately 
defining problems in indeterminate situ- 
ations, providing operational controls 
which will resolve the problem into a 
fully determinate situation, and imply- 
ing further propositions and concepts 
whose purported existential reference 
can be and is verified. 

A final characteristic of an acceptable 
scientific theory is expressed in a com- 
mon convention called the Law of Parsi- 
mony. This means that the clarity, ap- 
plicability, and effectiveness of a theory 
is greatly aided if it is economical in 


the number of non-existential postulates 
and hypothetical constructions. it de- 
pends upon. The general rule is to use 
the fewest and simplest hypotheticals 
which will adequately account for the 
known evidence. While generally useful, 
this rule is not inflexible nor too critical. 
Its justification is thoroughly pragmatic. 
A current example of the utility and ap- 
peal intrinsic in parsimonious simplicity 
is found in quantum mechanics, where 
the simplicity of mathematical form 
other things being equal, has determined 
the accepted theory. In psychology, a 
corresponding illustration is the attempt 
through factor analysis to establish a few 
primary abilities to account for a large 
number of test performances. 

In summary, the major characteristics 
of an acceptable scientific theory may be 
listed as follows: 


1. It should include all the known 
facts and laws in the field of inquiry. 


2. All facts, laws, and hypotheses must 
be logically deducible from the postu- 
lates of the theory. 


3. It should predict and explain in 
advance possible laws not yet verified. 


4. These implied laws must be veri- 
fiable either directly (as existential gen- 
eralizations) or indirectly by their indis- 
pensable power to solve scientific prob- 
lems (as hypothetical constructs). 


5. The simplest theory is likely to be 
the best theory. 








CHAPTER III 


MEASUREMENT IN SCIENCE 


EVERAL discussions of scientific meas- 
sy urement in connection with psycho- 
logical testing have appeared in recent 
years. B. O. Smith (44) offers a good treat- 
ment of the topic in his study of achieve- 
ment testing. An even more intensive 
analysis, based in considerable part on 
the writer’s work and original sources, 
has been published in a magazine article 
by Mark May (27; 42). Consequently, 
this chapter will present only such as- 
pects of the nature of measurement as 
will be needed for background and quick 
reference in the argument of the chapters 
to follow. 


THE NUMERICAL BASIS OF 
MEASUREMENT 


The importance of logic and especially 
mathematics to the processes of scientific 
inquiry was a point of major emphasis in 
the preceding chapter. In order to fit 
scientific propositions into the form re- 
quired for treatment by the logic of 
mathematics, numbers of course are in- 
dispensable. But numbers can perform 
any one of three kinds of functions, a 
fact which, operationally speaking, really 
gives us three kinds of numbers. The 
particular significance of this fact here 
is that only one of these kinds of num- 
bers is employed in fundamental meas- 
urement. 

First, numbers may be nominal—i.e., 
act as names of things. In this sense, 
numbers are used to designate football 
players, race horses, and prison convicts. 
In their use, there is no implication ‘of 
quantitative differences, either in terms 
of units or of rank. These numbers, since 
they stand for certain persons or things, 


are not interchangeable. Nor do players 
#4 and #5 combined equal players #2 
and #7. In fact, only by counting the 
number of numbers would one know 
how many players were on the team. 

Second, numbers may be ordinal—i.e., 
denote the places which objects occupy 
in some ordered series, such as from tall- 
est to shortest, or from hardest to softest. 
Ordinal numbers may be used when the 
property in question (e.g., “being tall” 
or “being hard”) varies between the per- 
sons or objects being ranked according 
to two essential relationships. The first 
is technically known as_ transitivity. 
Stated in formal terms, it means that if 
A has more of x than B, and if B has more 
of x than C, than A must also have more of 
x than C. Stated in concrete terms, it 
means that if Alan is taller than Bill, and 
if Bill is taller than Charles, then Alan 
must also be taller than Charles, and 
so on down through the group. But if we 
were to try to rank football teams in 
some well-balanced league in terms of 
the property, “can defeat this season,” 
transitivity would very probably not 
exist when the season was over, a fact 
which probably accounts for the choice 
of “percentage of games won” as the 
basis of ranking. 

The second relationship is known as 
asymmetry. Stated formally, it means 
that if A bears some relationship to B, 
then B does not bear that relationship 
to A. In short, the relationship flows just 
one way. An obvious illustration is that 
if A is taller than B, then B cannot be 
taller than A. It is this transitive-asym- 
metrical character of a property shared 
by a group which makes possible the 
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ranking of the members of the group in 
a series of ordinal numbers. Ordinal 
numbers imply quantitative differences, 
but not differences expressed in equal 
units. Moreover, the meaning of an or- 
dinal rank of 7 for a person, say, in an 
ordered series of “‘tallness” is simply that 
he is shorter than #6 and taller than 
+8. Considered in isolation from the 
other members of this group, his rank of 
7 has no quantitative meaning. 

The third function of numbers is 
counting or enumeration. Numbers that 
count are commonly called, for accurate 
reference, cardinal numbers or numerals. 
When units, or things, or persons are 
counted, one ignores their distinctive in- 
dividuality, which nominal numbers 
identify, and one also ignores their rank 
in an ordered series, which ordinal num- 
bers denote. Since cardinal numbers 
refer to the interchangeable aspects of 
objects or properties, these objects or 
properties are capable of addition and 
subtraction. 

It is extremely important to note how 
data represented by these three kinds of 
numbers are correspondingly different. 
They are different because the opera- 
tions involved in scientifically describing 
data that fit the logical requirements of 
each kind are different. When an object 
can be distinguished from other objects, 
it can be named with a nominal num- 
ber, which tags that object’s individual- 
ity. When a group of distinguishable ob- 
jects can be perceived as varying in the 
muchness of some empirically identifi- 
able aspect and can thus be placed, by 
paired comparisons, in an ordered series 
from “‘most” to “least,” then those opera- 
tions satisfy the requirements for use of 
ordinal numbers. In the first case, the op- 
erations establish that “4” means that ‘ob- 
ject and no other. In the second case, the 


operations establish that “4” means some 
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particular member of the group which 
stands between the third and fifth mem- 
bers with respect to muchness of a speci- 
fied property. 

In order to satisfy the logical require- 
ments for the use of cardinal numbers, 
the data in question must be subject to 
counting. This means that the members 
of the group must have some quality— 
at the very least, the quality of “being 
distinguishable objects”—in common so 
that any member of the group is actually 
interchangeable with any other member 
with respect to this quality. There are 
at least two critical tests of the adapta- 
bility of data to the use of cardinal num- 
bers. One is securing universal agree- 
ment that the objects are essentially 
identical with respect to the quality or 
property in question. The fundamental 
operation involved is the primitive act 
of discriminating between “sameness” 
and “difference” under specified con- 
ditions. The logical perfection theoreti- 
cally demanded by the use of cardinal 
numbers is never obtained exactly by the 
operation of direct discrimination, nor, 
for that matter, by any other physical 
operation. But the closer the data,fit the 
logical requirements, the more extensive 
can be the mathematical manipulation 
without appreciably altering the exis- 
tential truth of the results. So a second 
test of the adaptability of the data to 
cardinal numbers is always applied: 
when the cardinal numbers representing 
the quality in question are added or sub- 
tracted in any combination, equal sums 
of cardinal numbers must represent es- 
sentially equal amounts of the quality, as 
verified by existential results, When data 
fit these tests, very extensive mathemati- 
cal treatment is possible and the range 
of possibilities for scientific inquiry is 
greatly increased. To take an example 
pertinent to psychological testing, the 
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faniliar product-moment correlation 
technique is an available logical tool for 
data expressible in cardinal numbers. 


BASIC CONDITIONS OF MEASUREMENT 


It was observed in the first chapter 
that the term measurement is applied in 
educational and psychological testing to 
almost every conceivable use of numbers. 
Even in psychological testing for the 
purpose of strict scientific research, the 
term is unstandardized. This unfortu- 
nate practice is a source of confusion 
and misunderstanding to persons both in 
and outside the field of psychology, and 
it is significant that in the more mature 
science of physics each operationally dif- 
ferent method of quantifying data is 
given a distinguishing name. The quali- 
fications of the term measurement to be 
used here will follow the practice in 
the physical sciences just because differ- 
ent sets of operations need to be accorded 
a different name. This will mean, not 
that the nin-measurement methods of 
quantifying scientific data in psychologi- 
cal testing will be considered less scien- 
tific or less respectable, but that they will 
be treated as operationally different and 
subject to different logical and mathe- 
matical manipulation. 

1. Any method of quantifying data, 
whether it use ordinal or cardinal num- 
bers and whether it be called measure- 
ment or not, must deal with qualities or 
properties which are both asymmetrical 
and transitive in the group under con- 
sideration. However, in the discussion to 
follow, only those methods which em- 
ploy cardinal numbers in the quantifi- 
cation of data (with the exception of 
simple enumeration, to be discussed 
later) will receive the term measurement. 

2. A further point which is basic to 
any method of quantification is the na- 
ture of the quality or property being 


quantified. In accord with the logic of 
scientific inquiry discussed in the pre- 
ceding chapter, the quality in question is 
not to be thought of as exclusively a 
property of either the objects being 
quantified or of the instrument of quan- 
tification. Rather the quality is an ex- 
istential relationship between the objects 
and the instrument. This statement is 
obviously true in the case of measuring 
length, weight, or volts. It may not be so 
clear in the case of measuring tempera- 
ture, because the relationship between 
heat (or molecular movement) and the 
degrees on a thermometer is not directly 
verifiable and hence hypothetical. How- 
ever, the difficulty is overcome when it 
is seen that the relationship between, 
say, the boiling point of water and the 
height of a column of mercury in a 
standard tube is a verifiable, existential 
relationship. And since there is a series 
of such constant existential relationships 
with a tube of mercury, the hypothesis 
of indirect measurement of temperature 
through equally graduated degrees is 
sustained as scientifically worthy and 
useful, though still not an existential 
fact. 

Since the quality being quantified is 
an existential relationship, it is subject 
to the convential scientific require- 
ments of existential propositions. This 
means, in brief, that the quality or prop- 
erty proposed for quantification should 
be (1) empirically known, (2) known ob- 
jectively—i.e., capable of full intersub- 
jective knowledge, (3) defined by physical 
operations (very often the measuring in- 
strument constitutes the operational defi- 
nition), and (4) verified as existential 
under conditions permitting of falsifica- 
tion. ) 

For purposes of quantification, the 
existential relationship under considera- 
tion is subject to still further require- 
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ments in scientific use. First, the rela- 
tionship must be capable of quantitative 
variation without undergoing qualita- 
tive change itself. This is not a matter to 
be assumed but a matter to be experi- 
mentally tested and verified. For ex- 
ample, the operational definition of 
weight can be objectively verified as con- 
stant for a wide range of quantitative 
variations, but political conservatism has 
so little objectivity at present and is 
usually so poorly defined that one may 
seriously doubt whether the judgments 
of several persons on more or less con- 
servatism are actually referring to the 
same quality. Second, the relationship 
must be shared by all members of the 
group, the only differences between 
members (as reported by the measuring 
instrument) being variations in the 
amount of the relationship expressed. 
Chat is, the only differences reported be- 
tween objects by a pair of scales are 
differences in the amount of weight. 
Again this is a matter capable of and 
requiring experimental verification. 

In addition to these common condi- 
tions of all quantification, the various 
types of measurement, which use cardinal 
numbers, have several further character- 
istics. For one, the quality or property in 
question must be capable of expression 
in linear or geometrical extension. This 
condition, of course, limits the applica- 
tion of measurement, as strictly defined, 
to a relatively small number of proper- 
ties. Thus the scientific study of other 
kinds of properties requires other tech- 
niques than strict measurement. More- 
over, before a proposed measuring de- 
vice can be validly used, the linear or 
eeometric description of the property in 
question must be verified as an actual 
existential relationship. For another 
characteristic, the quality or property in 
question must be capable of expression 


in equal units—an obvious requirement 
for the use of cardinal numbers, To ful- 
fill the requirements of logical treatment 
by mathematics, the equality of the units 
is established, not by definition, but by 
confirming the proposition that the re- 
sults are existentially true when the units 
of the property are added and _ sub- 
tracted. 

Beyond these two characteristics of 
scientific measurement, there remain 
only the highly technical logical postu- 
lates of measurement, which are not es- 
sential to the later criticism in this study 
and which may easily be found ade- 
quately discussed elsewhere (27; 44, ch. 
4; 32, 315; 20, 345-46; 7, 117). In view 
of the series of common conditions de- 
scribed above, it is now appropriate to 
differentiate the various kinds of quan- 
tification. 


FUNDAMENTAL MEASUREMENT 


The most elementary form of meas- 
urement is the direct comparison of two 
objects, one of which is taken as the 
standard. In fundamental measurement 
in the physical sciences, essentially only 
one refinement is made of this procedure. 
Instead of finding the standard among 
the “given” objects, the scientist care- 
fully constructs and calibrates his own 
object for the special purpose of measur- 
ing. The chief characteristics of funda- 
mental measurement are as follows: (1) 
the measuring device and the objects 
measured have a property in common; 
(2) this property varies only quantita- 
tively between the two; (3) the measur- 
ing is accomplished by the physical op- 
eration of applying the former to the 
latter, and (4) the differences in quan- 
tity found among the objects satisfy by 
existential verification the axioms of ad- 
dition. Examples of properties capable 
of fundamental measurement include 
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length, weight, duration, volts, and the 
like. 


DERIVED MEASUREMENT 


When two or more cases of funda- 
mental measurement are combined in 
ratios to express a new existential re- 
lationship, the result is known as derived 
or surrogate measurement. The measure- 
ment of density is a typical example. 
Density is the numerical value derived 
from the ratio of the volume and weight 
of a substance. Once the units of volume 
and of weight are set, the densities of all 
substances maintain an exact, constant 
numerical relationship to each other. 
Although this type of measurement does 
not exhibit all the characteristics of fun- 
damental measurement, it expresses nu- 
merical constants which are directly de- 
rived from measures that do satisfy these 
requirements. 

In most cases of either of these two 
types of measurement, a scientific law is 
established with the achievement of 
measurement. The reason is that the 
measuring instrument is so constructed 
as to veryify or falsify a generalized re- 
lationship among the data being exam- 
ined. The size of the unit of measure- 
ment in these cases is arbitrarily 
established in terms of the quantitative 
ratios expressed by the verified scientific 
law in each case, Undoubtedly the great 
prestige of measurement in the physical 
sciences is largely due to its instrumental 
function in establishing numerical laws, 
which are susceptible to extensive mathe- 
matical transformation. 


INDIRECT MEASUREMENT 


This form of measurement is appro- 
priate to the determination of certain 
intensities. Variations in intensity are ex- 
perienced directly but they cannot be 
measured directly. Consequently, they 


are related, when possible, to some sub- 
stitute property which can be measured 
directly. Degrees of temperature, for ex- 
ample, are an expression of indirect 
measurement. We can experience varia- 
tions in temperature directly but not 
degrees of temperature. The substitute 
we commonly use is mercury in a hollow 
tube, for the linear expansions and con- 
tractions of mercury can be directly 
measured, Although evenly graduated 
degrees are marked off on the mercury 
tube, we have no direct assurance that 
temperature changes occur in the same 
linear fashion as the movement of mer- 
cury. Our only assurance is the con- 
stancy of the 1ielting or boiling points of 
many substances on the column of mer- 
cury. The actual measurement of tem- 
perature is thus not an existential fact 
but an extremely useful and productive 
hypothetical construct. In other words, 
the measurement of the expansion of 
mercury in the tube is direct and exis- 
tentially true; the measurement of ex} 
perienced variations in temperature is 
indirect and hypothetical. 


RANKING 


When the subject matter of scientific 
experimentation is not capable of ac- 
curate expression in geometric form, 
some other means of quantification be- 
sides strict measurement has to be found. 
One of these is the ranking of data from 
most to least. In physics, this method is 
commonly used in quantifying the hard- 
ness of substances. By the device of find- 
ing what, substances will scratch other 
substances and not be scratched thereby 
themselves, a rank order with respect to 
the common property of hardness can be 
established. Since ordinal rather than 
cardinal numbers are used in ranking, a 
different and considerably more _ re- 
stricted use of the logic of mathematics 
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is possible. Since the aim of science is 
the establishment of verified existential 
relationships, the careful scientist em- 
ploys only those logical forms which fit 
and accurately transform his data. 


ENUMERATION 


The enumeration of data involves the 
use of cardinal numbers but it is not, 
strictly speaking, measurement. For pur- 
poses of clarity in this discussion, when 
we mention counting the units on an es- 
tablished measuring device, we shall use 
the term measurement, but when we 
mention counting anything else, we shall 
use the term enumeration. This distinc- 
tion for purposes of clarity does not 
deny, of course, that the results of enu- 
meration are capable of the same exten- 
sive transformation by the logic of 
mathematics as any data expressed in 
cardinal numbers. The importance of 
making the distinction lies in the dif- 
ference in scientific meaning between 
“how many” and “how much.” Simple 
enumeration tells “how many,” and the 
only property that objects being enumer- 
ated need have is “being distinguishable” 
from other objects of a class or kind. ‘The 
counting involved in measurement tells 
“how much,” and the quality or property 
being thus measured is subject to many 


rigorous scientific controls, most of which: 


have been discussed above. Since to treat 
“how many” as equivalent to “how 
much” might easily result in false gen- 
eralizations-by-analogy, it is well to re- 
member that concepts in science which 
involve different operations are by that 
token scientifically different. 


QUANTITATIVE JUDGMENTS 


By far the most common form of quan- 
tification is the direct judgment. When 
we say that this is bluer, lovelier, fuj- 
nier, or smarter than that, we are making 


quantitative judgments. In making such 
judgments about an object, we may base 
our comparison in part on some similar 
object in the situation or on an immedi- 
ate memory of something similar, but an 
essential element in the judgment (es- 
pecially if it is to be an accurate or sound 
judgment) is our background of experi- 
ences with such objects. And herein lies 
the primary operational difference be- 
tween measurement and quantitative 
judgment. In measurement we have at 
hand an objective, calibrated instrument, 
and the only critical background re- 
quired is the power to discriminate such 
qualities as sameness, difference, and in- 
between-ness. The report of “how much” 
is the function of the measuring device. 
But in a dependable quantitative judg- 
ment, a great deal depends on the rich- 
ness of the judge’s experience with ob- 
jects of this kind, and usually the wider 
and more profound that experience, the 
more reliable and accurate is the judg- 
ment. In carefully executed measure- 
ment, the susceptibility of the results to 
arithmetical treatment is commonly as- 
sured without further need to verify the 
the products of this arithmetical ,treat- 
ment as existentially true. In careful 
quantitative judgments, no such assur- 
ance is had, although the numerical esti- 
mate of several quantitative judgments 
may be averaged to find the most likely 
hypothesis to work on. 

An interesting relationship between 
quantitative judgments and genuine 
measurement is to be found in studies of 
sensory discrimination in psycho-physics. 
In measurement in the physical sciences, 
the aim is to eliminate variations be- 
tween observers and to record only varia- 
tions in the data under experimentation. 
But in these experiments in  psych- 
physics, the aim is to hold the measured 
data constant and to observe the quanti- 
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tative differences in the sensory discrimi- 
nation or judgment of persons. Thus, by 
reversing the customary process in using 
a measuring instrument, a kind of in- 
direct measurement of sense discrimina- 
tion is possible. 

In these experiments, of course, the 
operation performed by the subject in 
distinguishing perceptible differences in, 
say, weight or speed is not measurement; 
but the operation of weighing the objects 
on scales or timing their rate of move- 
ment ts measurement. Moreover, one is 
not exact and the other only approxi- 
mate measurement. Even the true meas- 
urement operation is only approximate. 
The difference, as has been intimated 
above, is a difference in the kinds of 
operations. 

Since this point is quite critical in 
psychological testing, it will be well to 
consider a further illustration. It has 
been found in some experiments that an 
agreement of 75% or more among judges 
on two subjects being of equal weight, 
or of one object being twice as heavy 
as another, actually approximates very 
closely the difference found by independ- 
ent measurement. Now these judgments 
are not a true substitute for measure- 
ment because they depend upon inde- 
pendent measurement for proof of their 
validity. It would be merely arguing by 
analogy to claim that the use of 75% 
agreement among pooled judgments 
would be equivalent to measurement in 
fields not now capable of fundamental 
measurement just because this agree- 
ment has been found to hold in fields 
which are subject to fundamental meas- 
urement. For in the judgment of such 
qualities as radicalism, beauty, or per- 
sonal adjustment, where the operations 
of fundamental measurement are ap- 
parently inapplicable, there is not only 
no independent verifying check on the 


exactness of the quantitative differences 
seen by a group of judges, but also the 
very nature and function of these par- 
ticular qualities in human experience 
may be seriously misconceived and dis- 
torted. In the case of these qualities, or- 
dinal numbers and the method of rank- 
ing by paired comparisons are much 
more likely to fit the data than are 
cardinal numbers and measurement. 
The importance of distinguishing be- 
tween different kinds of measurement 
and other forms of quantification is well 
worth summarizing at this point. While 
in everyday usage the term measurement 
is synonymous with almost any form of 
numerical expression, there are at least 
three reasons why this practice is not 
permissible in scientific discourse. The 
first rests on the fact that various types 
of quantification are extremely impor- 
tant elements in the operational defini- 
tion of the existential propositions of 
science. One of the chief principles in 
the methods of modern science is strict 
observance of thorough-going opera- 
tional definition (3, ch. I). Second, of all 
the operational definitions of science, the 
most thoroughly unambiguous form is 
that of measurement in particular and 
the use of numbers in general (32). Not 
to discriminate and qualify the kind of 
numerical quantification being used 
would be to make ambiguous what is 
now the most unambiguous of forms of 
scientific expression. And third, the kind 
of numerical quantifications being used 
is, of course, directly related to the kind 
of logical treatment and mathematical 
transformation which is reliably appro- 
priate. In essence, the appropriate kind 
of mathematical treatment is that kind 
whose logical postulates most accurately 
fit the existential data. No competent 
person would add, for example, the num- 
ber of apples, oranges, and bananas he 
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possessed and then divide by three to 
find the average number of each he has. 
Nor would one expect an investigator to 
take the shifts in percentile rank which 
occurred among the members of a group 
over a period of time, and average them 
for a meaningful scientific result. And 
yet this latter example has occurred in 
supposedly reputable psychological re- 
search, indicating a serious misunder- 
standing of the logical postulates under- 
lying the several kinds of numbers. 

In the interests of sound scientific in- 
quiry, it is important not only to observe 


the logical conditions of the use of 
numbers but also to place mathematical 
treatment in the exclusively instrumental 
position of producing, from the original 
data of the problem, hypotheses capable 
of existential verification. Consequently, 
to avoid the possibility of using the logic 
of mathematics for self-delusion, the in- 
vestigator should not attempt to force his 
data into a preferred pattern of logical 
forms, but should attempt to find the 
particular logical pattern which will 
most accurately fit his data. 





CHAPTER IV 


MENTAL TESTS AS MEASURING INSTRUMENTS 


ENTAL testing is, of course, intended 
M to yield quantitative data about 
human behavior. However, as the pre- 
ceding chapter has indicated, that quan- 
tification may be achieved by several types 
of methods, usually depending on the 
nature of the data and the techniques 
available. Which method of quantifica- 
tion is employed makes considerable dif- 
ference in the meanings and uses that are 
appropriate to the test results. It is un- 
fortunately common practice to refer to 
all forms of mental testing as “measure- 
ment,” but in the succeeding discussion 
we shall follow the stricter terminology 
developed in the previous chapter and 
attempt to find out, by critical examina- 
tion, what kind of quantification is ac- 
tually employed by certain kinds of tests. 
Some tests are not built or used for the 
purpose of contributing to a science of 
psychology. Others appear to be devoted 
to that purpose but misconstrue or vio- 
late the principles of the kind of quanti- 
fication they are claimed to represent. 
In this chapter, the concern will be with 
(1) the kind of quantification employed 
(2) by those tests which seek to establish 
some basic facts, laws, or generalizations 
in a science of psychology. 

Modern physical science has taken 
over three hundred years to clarify and 
reformulate its concepts. Experimental 
psychology, a dubiously grateful heir to 
a mass of ethical and epistemological 
concepts, has had scarcely forty years. 
That it has made some progress is re- 
flected in the old bromide that psy- 
chology first lost its soul and then lost 
its mind, Some exponents of behaviorism 





are now advocating that it next lose con- 
sciousness, and from other quarters 
comes word that it may also lose its 
“general intelligence.” By and _ large, 
these “losses” represent scientific gains. 
They mean the abandonment of certain 
verbalisms, charged with superstitions 
and unverifiable connotations, in favor 
of more neutral concepts which can be 
operationally defined according to scien- 
tific methodology. But there is still much 
work to be done. Many concepts of abili- 
ties and traits still base much of their 
standing on an appeal to the most naive 
form of “common sense,” and certain 
cases of their definition are not only con- 
fusing but differ radically among them- 
selves. 

The mental testing movement to date 
has been chiefly concerned with these 
most nearly “common sense’’ aspects of 
psychology. That is, it deals extensively 
with such concepts as “intelligence,” 
“honesty,” and “arithmetic ability,” 
which have a familiar ring to the lay- 
man’s ear. Other non-test experimental 
approaches, which have produced con- 
cepts like “the conditioned reflex,” “S-R 
bonds,” “valences and vectors,” and “de- 
fense mechanisms,” probably do not 
enjoy so much popular understanding— 
and misunderstanding. The obvious rea- 
son for this is that the great majority of 
concepts dealt with by mental tests has 
been taken directly from the common 
usages of our culture. Consequently, as 
most test builders are aware, there is con- 
siderable danger that the concepts to be 
measured by tests will not have the con- 
stant, unambiguous meaning required 
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for scientific treatment. 

The great bulk of group tests are de- 
signed to quantify abilities and person- 
ality traits. And of these, the overwhelm- 
ing majority are distinguished by a char- 
acteristic which justifies placing them in 
a single category. This characteristic is 
that each test consists of a series of tasks- 
to-be-accomplished. Accomplishing tasks 
is certainly one of the most obvious, ob- 
jective indices of the worth of one’s men- 
tal equipment, and is admirably suited 
to testing. Tests in this category include 
the best examples of those which attempt 
the most rigorous quantification of be- 
havior data,* and hence they will be 
treated here as representative of the 
problem of test measurement for scien- 
tific purposes. 


DEFINING THE PROPERTY FOR 
QUANTIFICATION 


The first common condition of all 
quantification is an operational defini- 
tion of the existential relationship or 
property which is to be the subject mat- 
ter of the quantifying process. The satis- 
faction of this condition precedes the 
construction or selection of a quantifying 
device, and in some psychological testing 
it is undoubtedly the real (but often un- 
suspected) source of the invalidity and 
unreliability of the test results. In deal- 
ing with such concepts as “literary abil- 
ity,” “social leadership,” or “emotional 
balance,” we find our present knowledge 
severely limited as to whether it is one or 
several things, and what is the nature 
of its extension. We know something 
about how to develop, for example, 
literary expression, and we can recognize 


* The outstanding exception is Thurstone’s 
technique of “attitude measurement,” but since 
no direct contribution to the science of psychology 
is claimed for this treatment of attitudes, it' can 
safely be omitted from consideration here. 


¥ 


differences in literary talent between per- 
sons, but no one yet has succeeded in 
defining its ingredients or dimensions in 
a manner capable of scientific verifica- 
tion, and hence it is not yet susceptible 
to measurement. This state of affairs is 
by no means limited to problems of psy- 
chological interest. The medical profes- 
sion, as Wechsler (64, 19) points out, is 
unable to measure “susceptibility to dis- 
ease” for the same reasons. 

Granting that an “ability” is the gen- 
eral object of a test, just what is the 
particular property to be quantified and 
where is it located? Virtually all investi- 
gators agree that an ability is not meas- 
ured directly by a psychological test. By 
this, they probably mean that a psycho- 
logical test does not directly describe the 
presumed neurological nature or organic 
capacity of the ability, hypothetical as 
these may be. Such an interpretation 
rules out the possibility that testing is 
either fundamental or derived measure- 
ment, both of which depend on the di- 
rect application of the measuring device 
to the property in question. There re- 
mains the possibility of indirect measure- 
ment of the ability through direct 
measurement of the performance it pro- 
duces, as temperature is indirectly meas- 
ured by the direct measurement of the 
expansion of mercury. On the other 
hand, if ability be defined as a complex 
relationship between a purposing indi- 
vidual and a problematical situation (in 
this case, the test), then the test par- 
ticipates directly in that relationship and 
some form of direct quantification ap- 
pears possible. Different psychological 
assumptions concerning the nature of 
an ability thus lead to diverse conclu- 
sions as to whether testing is direct or 
indirect measurement, and they will be 
taken up in the following chapter. But 
no matter whether performance be con- 
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sidered as identical with ability or only 
correlated with ability, it is still the test 
performance which is to be subjected 
to direct measurement. Consequently, 
the test performance constitutes the lo- 
cation of the property or properties to 
be described in measurement. 

Careful examination reveals that there 
is not one but several properties.or quali- 
ties of significance in a test performance 
—i.e., whether it is arithmetic perform- 
ance, intellectual performance, or read- 
ing performance. This quality of per- 
formance often suffers the scientific 
handicap of not only being culture- 
bound but even having a considerable 
variety of meanings to persons within 
the same culture. It seldom has a com- 
monly accepted operational definition, as 
the many different kinds of tests bearing 
the same name eloquently testify. Per- 
haps the most_notable attempt to avoid 
this difficulty is Thorndike’s use of the 
concept “CAVD intellect” (55) to de- 
scribe the character of the performance 
elicited by his test. For the most part, 
other tests attempt to put the cart before 
the horse by using fairly high correla- 
tion coefficients between similarly named 
tests to show that different operational 
definitions (in which each test represents 
an operational definition) really describe 
the same thing. But a high correlation 
does not necessarily establish equiva- 


lence*—the correlation might just as well . 


hypothecate constant association of two 
factors or a causal relationship between 
the two. Only controlied verification can 
establish which might be the case, and 
none of these abilities appears to be 
sufficiently understood yet to make such 
a test possible. The nearest approach so 
far to experimental verification of the 
scientific meanings of inter-correlations 


‘For a complete discussion of alternatives, see 
W. S. Monroe and D. B. Struit (go). 


is being pursued by certain factor ana- 
lysts. 

Another recourse sometimes taken to 
circumvent the need of unambiguously 
identifying the property being measured 
is to say that the test measures whatever 
it does measure, an obvious case of beg- 
ging the question and of reversing the 
logical order of scientific inquiry in that 
the result is a question instead of a solu- 
tion. A more feasible procedure than 
this would be to insure experimentally 
that the performances elicited by the 
test are genuinely representative of a 
kind of human behavior which can be 
unambiguously distinguished from all 
other kinds of human behavior. Thorn- 
dike’s serious attempt to make intellect 
CAVD a concept with scientific standing 
will be considered more fully in the next 
chapter. 

A second significant property or qual- 
ity in a test performance is its accuracy. 
Since only the right answers are counted 
in most cases, it is sometimes forgotten 
that the wrong answers are also a part 
of the test performance. Test builders 
take great pains with the quality of ac- 
curacy, and there is seldom any question 
concerning objectivity and common 
agreement on this quality. It must be 
remembered, of course, that test items 
dealing with social and moral judgments 
accepted as sound only in our society, 
instead of with presumably universal 
facts, would obviously be ambiguous if 
not invalid in another social group. It is 
important to notice also that when this 
property is subjected to quantification, it 
is almost always accuracy which is quan- 
tified, not inaccuray, although the latter 
is undeniably an existential aspect of 
the test performance. 

Two more significant properties of the 
test performance are the amount of work 
done and the speed with which it was 
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done, sometimes expressed as a single 
property rate. The amount of work done 
is almost always expressed as the number 
of test items completed correctly, and 
hence accuracy and amount are ex- 
pressed as functions of each other. The 
speed with which the work is done is of 
course expressed by time duration, and 
is independently measured with a chro- 
nometer. 

While we have specified several sig- 
nificant aspects of a test performance— 
its character, its accuracy, and its rate— 
all of these are dependent upon a basic 
property which ultimately determines 
the worth or goodness of the perform- 
ance. This property is usually designated 
as the difficulty overcome by the per- 
formance. Only a brief scrutiny is re- 
quired to show the dependence of the 
three other aspects of the test perform- 
ance on this fundamental concept. First 
of all, the term never means absolute 
difficulty, but that kind of difficulty rep- 
resented by the character of the test, 
such as “reading difficulty” or “CAVD 
intellectual difficulty.” Second, the prob- 
lem of achieving correct answers in a test 
performance is essentially a problem of 
overcoming whatever difficulty stands in 
the way of making a correct response. And 
finally, the rate of performance is also a 
function of the difficulty of the tasks. If 
the time permitted is held constant, the 
number of tasks accomplished will vary 
with the level of difficulty. If the time per- 
mitted is severely restricted, time itself 
becomes an element of difficulty. If the 
time permitted is extended far enough, it 
ceases to ‘have any effect on the perform- 
ance and only difficulty is the influen- 
tial factor. Consequently, the key problem 
in quantifying test performance is essen- 
tially the problem of quantifying the com- 
plex property of difficulty. In applying the 
further conditions of scientific quantifi- 


cation, we shall concentrate on this criti- 
cal property of test performance. 


DEFINITIONS OF DIFFICULTY 


Two distinct manifestations of diffi- 
culty need-to be differentiated. One is 
difficulty for the individual and the other 
is difficulty for the group. These two 
conceptions may or may not be equiva- 
lent, depending on the operations by 
which each is defined and the verification 
of their equivalence. 

The difficulty of a series of tasks for 
an individual could conceivably be de- 
fined by physical operations other than 
his test performance. For example, the 
operations of measuring the electrical 
activity of the brain during work on a 
series of tasks might be made sufficiently 
sensitive to reveal, say, degrees of in- 
tensity in the concentration required. 
But this is only a remote possibility and 
has not yet been accomplished. More- 
over, since this discussion is concerned 
with mental tests .as scientific instru- 
ments, only definitions of difficulty ap- 
propriate to the operations of testing 
need be considered. 

A possible operational definition of 
individual difficulty in a series of test 
items would be to record the amount of 
time the testee spent on each item, This 
method of definition, however, is open to 
several practical objections. First, it’ 
would be appropriate only to those test 
items which were so similar in content 
as to require but a single set of direc- 
tions (as a series of tasks in addition or 
subtraction) or which had different sets 
of directions equated in some way. It 
would also require careful control of the 
testing conditions so that all factors 
which the investigator wished to exclude 
from the difficulty of the tasks would 
actually be eliminated. Further objec- 
tion (54, 230) has been raised to it be 





MENTAL TESTS AS INSTRUMENTS OF SCIENCE 27 


cause all testees would have to pass each 
test item perfectly or equally imperfectly 
in order for the time units to be strictly 
comparable. To the writer’s knowledge, 
this method of defining difficulty is not 
employed in any currently published 
tests of abilities and aptitudes. 

Another possible operational defini- 
tion of individual difficulty would con- 
sist in determining the percentage of 
successes among repeated trials by the 
same person of the same test item, or the 
same group of test items.® The more diffi- 
cult item (or group of items) would be 
the one on which he succeeded least 
often, and the ratio of successes to fail- 
ures over a series would provide a scale 
of difficulty. Because of the enormous 
labor involved in carrying out these op- 
erations, to say nothing of controlling 
such influences as the practice effect, this 
definition is seldom if ever used in cur- 
rent testing. In fact, virtually all mental 
tests are based on a definition of group- 
difficulty, and the problem of making 
such a definition have scientific meaning 
for the difficulty of an individual’s per- 
formance is attacked, if at all, in later 
operations. 

In the definition of difficulty for the 
group, there are two distinct sets of op- 
erations employed in common practice. 
The first employs essentially physical 
operations and is designed to indicate 
the group difficulty of each test item. 
The second is primarily a hypothetical 
construct, depending essentially on logi- 
cal operations, and is designed to indi- 
cate the group-difficulty of the test as a 
whole. In view of the great significance 
of these two definitions for current men- 
tal testing, both will bear close examina- 
tion. 

The first operational definition of 


* Suggested by Thorndike (55, 27), but rejected 
by him as unfeasible. 


group difficulty, based on determining 
the difficulty for the group of each test 
item, commonly employs the following 
procedure. A large group of some speci- 
fied homogeneity and usually considered 
representative of some larger group of 
which it is a sample, is given a series of 
test items. The smaller the percent of 
the group which gives the right or de- 
sired response on any of the items (con- 
trolling, of course, such extraneous fac- 
tors as ambiguity or reconditeness in the 
items), the more difficult that item is con- 
sidered to be for that group. Thus, an 
item passed by 80 percent of the group 
would have only 20 percent difficulty, 
while a more difficult item passed by only 
20 percent of the group would have 80 
percent difficulty. This method of defi- 
nition is reputed to be generally charac- 
teristic of all tests using the difficulty 
concept (16, 410). However, later dis- 
cussion in this chapter will indicate that 
this definition is not strictly adhered to 
in most cases, largely because of the tech- 
nical difficulties involved, and that pref- 
erence is given to the operational con- 
cept to be described next. 

The other definition of group diff- 
culty, based on determining the difficulty 
of the test as a whole, does not deal with 
the performance of the group on indi- 
vidual test items, but depends on some 
distribution of the scores made by the 
members of the group. Test items as 
such lose whatever individuality they 
might have as particular amounts of dif- 
ficulty, and each one that is answered 
correctly is viewed as interchangeable 
with any other answered correctly. On 
these grounds, the correctly answered 
items of each person in the group may 
be enumerated, and the sum of these 
items becomes his raw score of perform- 
ance. The distribution of the perform- 
ances of the members of the group will 
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vary from the lowest raw score to the 
highest raw score. So far the terms “low- 
est’ and “highest” mean only the num- 
ber of different items passed, but to 
many persons it seems reasonable to as- 
sume that achieving the highest score 
means overcoming more difficulty than 
achieving the lowest score. On this as- 
sumption, the range of differences be- 
tween the raw scores is viewed as cor- 
responding roughly to differences in dif- 
ficulty overcome, and some method is 
used to convert the raw scores into de- 
rived scores representing amounts of dif- 
ficulty. The scientific meaning of these 
derived scores will be discussed in a later 
section. The important point here is that 
a certain assumption regarding the distri- 
bution of difficulty in raw scores has 
been used to achieve a definition of 
group difficulty through logical (not 
physical) operations. Whether this defi- 
nition is a true representation of an 
existential fact remains a matter to be 
verified experimentally. 


IS DIFFICULTY A HOMOGENEOUS 
PROPERTY? 


One of the elementary conditions of 
all quantification is that the property in 
question be homogeneous throughout 
its range of expression. Stated another 
way, this means that the property must 
be unvarying in quality from low to 
high amounts so that all differences re- 
corded, under controlled conditions, will 
be differences only in the amount of that 
property. 

The property of group difficulty, as 
defined in either of the two usual ways, 
is seriously open to question in regard 
to homogeneity. At our present stage of 
knowledge, difficulty as a concept has 
a highly phenomenological character, 
blanketing all the known and unknown 
factors that stand in the way of 5 ied 


ing a “good” performance. Considerable 
effort is made to control certain of these 
factors which most obviously participate 
in or qualify the difficulty of perform- 
ance. For example, such things as ill 
health, loud noises, and unusual emo- 
tional disturbances are excluded from 
the test situation as invalid aspects of 
difficulty. Other factors which cannot be 
excluded are presumably made equal. 
For example, care is taken to see that the 
testees are from approximately the same 
general environment and that all have 
had approximately the same opportunity 
to acquire familiarity with the material 
of the test. The directions accompany- 
ing the test are designed to mean as 
nearly as possible the same thing to all 
testees. The length of time permitted is 
held constant or made long enough to 
exercise no influence. And in order to 
equate the various purposes in taking the 
test, the examiner attempts to establish 
a rapport which will insure that every- 
one is trying as hard as he can to make 
as high a score as possible. The fact that 
these controls are not very precise is not 
viewed as a serious handicap, because it 
is assumed that the uncontrolled varia- 
tions remaining operate according to 
chance and cancel each other out for 
the total group. As long as it is the diff- 
culty ofthe group performance, not the 
individual performance, which is being 
quantified, this assumption appears rea- 
sonable if the group is large. Further im- 
plications of this point will be discussed 
in a later section dealing with the nor- 
mal curve. : 

In spite of these controls on the con- 
tent of difficulty, there remain further 
questions which threaten the hypothesis 
that difficulty is a homogeneous property. 
For one, how can we make a scientific 
distinction between the kinds of diffi- 
culty represented in various tests, such 
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as “arithmetic difficulty” or “reading 
difficulty” or “intellectual difficulty’? 
Some investigators frankly make an ap- 
peal to common sense on this point and 
take popular usage as their criterion. 
Others who are more interested in get- 
ting at psychological facts appropriate to 
a science attempt to establish the distinc- 
tion by low correlation coefficients be- 
tween reputedly different kinds of tests 
and high correlation coefficients between 
tests judged to be very similar. This pro- 
cedure, of course, establishes only statisti- 
cal or logical hypotheses of difference 
and equivalence and, while it may serve 
certain evaluational purposes, does not 
in itself verify the existential difference 
or equivalence sought by a science. Fur- 
ther attempts of a very few experi- 
menters to overcome this problem will be 
considered in the following chapter. 

After all these precautions have been 
taken, the concept of difficulty is still a 
catch-all term. It is the residuum of un- 
known composition after the above fac- 
tors have been eliminated or equated. 
As a consequence, it is a bold test builder 
who feels in a position to say that there 
are no qualitative differences in “difh- 
culty” from one end of his scale to the 
other. The problem is perplexing enough 
for any given group, but it becomes es- 
pecially acute when a test attempts to 
provide a continuous scale of difficulty 
through several age or grade levels. Ef- 
forts to solve the problem by correla- 
tional methods (i.e., only hypothetically) 
have not been convincing to the most 
careful investigators, and the homo- 
geneity of difficulty between age levels, 
as Boynton recently observed, has “never 
been demonstrated unequivocally to be 
true” (5, 192). 

Establishing the homogeneity of dif- 
ficulty is essential to any kind of scien- 
tific quantification of it, but the question 


of which kind of quantification is ap- 
propriate depends upon how difficulty 
varies in amount. Measurement requires 
that the property vary on a continuum, 
describing “how much.” On a con- 
tinuum, fractions of units as well as 
whole units can be verified as equal and 
submitted to addition and subtraction. 
Enumeration requires only that the ob- 
jects exhibiting the property in question 
be distinguishable. The property enu- 
merated may be a continuum, as length 
when inches are counted, or it may be 
dichotomous, as ‘“apple-ness” when 
apples are counted, but enumeration de- 
scribes primarily “how many.” Ranking 
requires that the property vary in terms 
of “how much,” but the ordered series 
may (and usually does) progress discon- 
tinuously from item to item, just as long 
as the principles of transitivity and asym- 
metry are observed. 

We do not yet have a definition of dif- 
ficulty which describes that property on 
a verifiable continuum. The difficulty of 
test items for the group, as computed 
by the ratio of successes to failures, in- 
creases by jumps rather than continu- 
ously (42, 307). Neither test items nor 
members of the group can be divided so 
as to give fractions of a unit increase in 
difficulty. The method of defining difh- 
culty in terms of the distribution of raw 
scores merely assumes a continuum of 
difficulty corresponding to the score dis- 
tribution without providing a means of 
verifying the assumption. These facts ap- 
pear to eliminate current concepts of dif- 
ficulty from the possibility of strict meas- 
urement. Ranking and enumeration seem 
to be the appropriate methods of quan- 
tifying test results. And yet a caution 
should be repeated. Until difficulty can be 
defined as an unambiguous, iso'abie qual- 
ity of the relationship between testee and 
test item, even the possibility of scientific 
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ranking of persons in this dimension ap- 
pears remote. 

The basic trouble is that we do not 
know enough about the nature and ex- 
tension of performance difficulty to at- 
tempt to verify experimentally its homo- 
geneity, let alone its continuity. The 
common practice so far has been too 
often to assume its homogeneity without 
offering any verifiable hypothesis which 
would scientifically test this assumption. 
As far as we really know now, difficulty 
may not be a homogeneous property 
varying only quantitatively, but may be 
the phenomenological effect of a large 
number of independent and quite dif- 
ferent factors. This will remain an open 
question until the alternative hypotheses 
have been formulated and then verified 
or falsified. For the time being, in de- 
scribing the range of a group’s perform- 
ance in terms of either concept of diffi- 
culty, we cannot say with assurance to 
what degree the range expresses differ- 
ences in amount and to what degree dif- 
ferent qualities. 

Thus we have found considerable 
grounds to doubt that difficulty, which is 
the critical property in the quantification 
of test performances, is sufficiently well 
known in nature and extent to be de- 
clared a homogeneous property for the 
kind of quantification of significance to 
science. Of course, as a practical standard 
for certain kinds of evaluation, the pres- 
ent definitions of difficulty may be en- 
tirely adequate. But considerably more 
experimentation and_ verification is 
needed if the quantification of difficulty 
is to yield unambiguous facts and gen- 
eralizations for a science of psychology. 


TRANSITIVITY AND ASYMMETRY OF 
DIFFICULTY 


The conditions of transitivity and 
asymmetry must be satisfied in all meet 


tification, whether the process be meas- 
urement, ranking, or direct quantitative 
judgment. For the rigorous purposes of 
science, the satisfaction of these condi- 
tions is achieved by employing some 
kind of physical operations and seeing 
whether the results actually satisfy the 
logic of transitivity and asymmetry. For 
example, establishing a rank order 
among various substances in regard to 
the property of hardness requires experi- 
mental proof that the hardness of these 
substances follows the conditions of tran- 
sitivity and asymmetry. Accordingly, 
when carborundum is found to scratch 
iron without being appreciably scratched 
itself, and when iron is found to scratch 
lead in the same way, the transitivity of 
the series so far is proved when carbo- 
rundum is also found to scratch lead 
without being appreciably scratched it- 
self. Asymmetry is established when no 
piece of lead can be found which will 
scratch iron without being scratched it- 
self, or scratch carborundum under the 
same conditions, and when no piece of 
iron can be found which will scratch 
carborundum. We shall find that one 
concept of difficulty satisfies these two 
conditions, but that the other (which is 
most commonly used in mental testing) 
remains only a logical hypothesis with- 
out experimental demonstration. 

First to be considered is that concept 
of group difficulty which is defined by 
the ratio of successes and failures by the 
group on each test item. With this defi- 
nition, difficulty may either be held con- 
stant through a series of items or be 
graduated in some systematic manner 
from item to item. 

1. When difficulty is held constant— 
i.e, when only those items of equal dif- 
ficulty are retained in the test—the prop- 
erty of number-done-correctly is de- 
scribed by simple enumeration. But 
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when the test items are all actually of, 
Say, 50 percent difficulty for the group 
(with enough time allowed), the number 
done correctly by the total group will 
automatically be 50 percent or half the 
possible number which could be done 
correctly. Thus, for this group, the 
number-done-correctly is not a variable 
property and the question of transitivity 
and asymmetry would not arise. More- 
over, if the test were given to other 
groups of apparently similar compo- 
sition, the property of number-done- 
correctly of course might vary, but under 
untimed conditions this would be ex- 
cellent evidence that the level of difh- 
culty was not actually constant and 
identical among the groups. 

So, to get any significant quantifica- 
tion, tests of this type are almost always 
timed so that few, if any, members of the 
group will finish. On the face of it, this 
looks as though difficulty-with-plenty-of 
time (which was constant from item to 
item) has been exchanged for difficulty- 
with-limited-time. In test construction 
where this exchange is actually the case, 
the concept of difficulty-with-limited-time 
can no longer apply to the individual 
test items (since under timed conditions, 
the entire group will not attempt every 
item). Instead, the concept of difficulty- 
with-limited-time must be defined for 
the test as a whole, probably in terms of 
the distribution of raw scores. 

But in most cases, the intention is not 
to exchange these two concepts of diffi- 
culty but to make an operational distinc- 
tion between speed of performance and 
the remaining difficulty of performance. 
Presumably the original group difficulty 
persists, and the only variable is the 
speed, as indicated by the range in the 
number done correctly in the specified 
time. For the greatest assurance that the 
difficulty is held constant, it is common 


practice to give a timed test over a series 
of items that go to 100 percent of the 
group could do if they had enough time. 
Then enumeration becomes the legiti- 
mate method of quantifying the test per- 
formance of the group, and as a matter 
of course the conditions of transitivity 
and asymmetry are satisfied, by the logic 
of counting, in that passing ten items in 
a specified length of time is more (i.e., 
faster) than passing eight items, and pass- 
ing eight is more than passing six. 
While the procedure just described is 


common practice and justifies quantifi- 


cation by enumeration we should men- 
tion in passing that actual measurement 
of individual performances would be 
possible in principle under slightly dif- 
ferent conditions. If each person in the 
group could correctly answer all the test 
items provided enough time were al- 
lowed and if the length of time required 
by each person to answer all the items 
were recorded, we should have a measure 
in units of time of each person’s speed 
of performance on the test as a whole. 
Of course, in order for the time units to 
be strictly comparable between persons, 
each test item would have to be passed 
perfectly (or equally imperfectly), the 
difficulty of the test directions would 
have to be equated, and other factors af- 
fecting or composing the difficulty of the 
test would have to be similarly con- 
trolled. As indicated on an earlier page, 
this procedure of test construction and 
scoring is seldom, if ever, employed in 
current tests of mental abilities. 

2. When difficulty itself is made the 
variable dimension for quantification, 
the conditions of transitivity and asym- 
metry are satisfied by the operations of 
defining this kind of difficulty. For ex- 
ample, when the difficulty of a series 
of test items for a specified group is de- 
fined as the percentage of the group who 
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pass or fail it, then item-A passed by 50 
percent of the group is by operational 
definition more difficult than item-B 
passed by only 40 percent of the group, 
and item-C passed by only 39 percent of 
the group. There is no later verification 
involved here. If any doubt is raised 
about the stability of the transitive and 
asymmetrical series thus defined, the de- 
fining operations are simply repeated— 
i.e., the test is given again to see whether 
the percentages remain fairly constant. 
An excellent method of checking 
whether two or more groups are similar 
or representative of each other is to dis- 
cover whether the same order of transi- 
tivity for the test items holds for each of 
the groups. 

Finally, consideration needs to be 
given to the other kind of group diff- 
culty, defined for the test as a whole in 
terms of the distribution of scores. It will 
be recalled that the range from low to 
high raw scores on a test was assumed to 
represent a roughly corresponding range 
of difficulty overcome. Since the raw 
scores themselves are obtained by adding 
the number of items answered correctly, 
the various total scores are operationally 
incapable of expressing a range of diff- 
culty. The raw score totals are therefore 
converted into ordinal numbers, such as 
percentiles or quartiles. With respect to 
transitivity and asymmetry, this means 
that such raw scores as 86 and 79 and 
77 are neither transitive nor asymmetri- 
cal in regard to the amount of difficulty 
overcome, These raw scores are merely 
the products of counting, and are transi- 
tive and asymmetrical to each other only 
in the sense of being more or less items- 
passed-correctly. But if it be assumed 
that higher scores tend to'represent more 
difficulty overcome and if the range of 
scores is converted into, say, a percentile 
rank order, then score 86 may fall in the 
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sixtieth percentile and the scores 79 and 
77 may fall in the fifty-eighth percentile. 
In this event, transitivity and asymmetry 
in regard to difficulty are implied be- 
tween these two percentile ranks and 
throughout the entire ordered series. 
What is the test that the conditions of 
transitivity and asymmetry are actually 
satisfied? 

There is no experimental test in this 
case, because such a test would require 
experimental demonstration that all 
scores falling in, say, the sixtieth per- 
centile always represented more group 
difficulty overcome than all scores falling 
in the fifty-ninth percentile, and so on 
throughout the entire series. But since 
group difficulty here is defined in terms 
of an assumption which yet requires 
proof as an existential fact (viz., the as- 
sumption that the distribution of diff- 
culty overcome corresponds with the dis- 
tribution of raw scores), there is no ex- 
ternal standard by which to make the 
experimental demonstration that the 
conditions of transitivity and asymmetry 
are satisfiec. The satisfaction of these two 
conditions is merely a logical correlate 
of the original assumption, and depends 
ultimately on the verification of this as- 
sumption. Some investigators attempt to 
get around this inexorable conclusion by 
extensive statistical manipulation of the 
distribution of raw scores, but no 
amount of mathematics can prove the 
transitivity and asymmetry of the data 
with respect to group difficulty unless 
the original concept of difficulty defined 
in hypothetical terms is verified as an 
existential fact. This point will be re- 
emphasized in later sections of this chap- 
ter. 

GROUP DIFFICULTY INAPPLICABLE TO 

DIFFICULTY FOR THE INDIVIDUAL 


By the character of the operations per- 
formed, the two concepts of difficulty are 
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scientifically applicable only when deal- 
ing with the combined performances of 
the total group, or some statistical reifi- 
cation like the “average person” in that 
group. Can these concepts represent at 
the same time difficulty for each indi- 
vidual in the group? Since they are not 
equivalent operationally, they must be 
proved experimentally to be equivalent. 
One of the most likely means of doing 
this is to establish that the transitivity 
and asymmetry found for group difficulty 
is also applicable to the difficulty over- 
come by each individual. 

1. In regard to that kind of difficulty 
defined by the ratio of successes and 
failures by the group on each test item, 
the transitive and asymmetrical rank or- 
der is maintained as long as similar per- 
centages of equivalent groups of persons 
succeed on each of the test items. In 
a series of items of 50 percent difficulty, 
it does not matter for group difficulty 
which persons in the group succeed or 
fail on those items just as long as 50 
percent of them succeed or fail. But in 
order to use this concept of difficulty in 
the quantification of individual perform- 
ances, the first hypothesis to be proved is 
that the 50 percent who passed or failed 
these test items were in every case the 
same persons. This gives us a two-point 
scale of difficulty—these persons cannot 
pass the test series; those persons can. 
The second step, in order to achieve 
quantification of individual perform- 
ances beyond a two-point scale, is to con- 
firm, for example, that everyone who 
passes an item of 80 percent difficulty 
also passes all other items of less diffi- 

culty, and that no person who fails an 
item of only 20 percent difficulty passes 
any item of higher difficulty. This step 
is simply verifying the further hypothesis 
that the conditions of transitivity and 
asymmetry established by operational 


definition for the group also hold for the 
individual members of that group. If 
these two hypotheses are verified, the 
concept of group difficulty is also veri- 
fied as applicable to the individuals in 
the group, and individual performances 
may thus be unambiguously quantified. 
This conclusion is entirely hypothetical, 
however, for the writer is not aware of 
any experiment with a seriously con- 
structed test where these hypotheses were 
verified. 

2. The same situation exists with re- 
spect to the second kind of difficulty, de- 
fined by the distribution of raw scores, 
although the hypotheses to be verified in 
this case are somewhat different. This 
kind of difficulty refers, not to successes 
on indivdual items, but to the rank order 
of the sums of the test items passed. Let 
us suppose, for example, that a score of 
50 items passed is the average score for 
the group and thus, by assumption, the 
point of average difficulty for the group. 
Since individual variations presumably 
cancel each other, it does not matter for 
the definition of group difficulty what 
particular test items go to make up any 
score of 50. But before it can be stated 
that two individuals making a score of 
50 Overcame approximately the same 
amount of difficulty, it must be shown 
that the 50 particular items passed by 
one person contained the same amount 
of difficulty as the 50 particular items 
passed by the other. Without such knowl- 
edge, the possible fact that the first per- 
son’s score of 50 includes only half the 
items represented in the second person’s 
score of 50, and vice versa, makes it very 
probable that the individual difficulty 
overcome by each person is not only 
different but so far incomparable. 

But suppose that each person passed 
the same 50 items. May they now be com- 
pared as having overcome equal amounts 
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of difficulty? To some this is a plausible 
conclusion, but the logical consequences 
in establishing a similar basis of indi- 
vidual comparison throughout the distri- 
bution of raw scores made by the group 
involve so many unlikely coincidences 
that this solution does not appear prac- 
ticable. In essence, a defense of this po- 
sition requires proof not only (a) that 
persons at any given rank, such as the 
fiftieth percentile, had all passed identi- 
cal items in the number required to 
reach that rank, but also (b) that a 
score of 50 (required, let us say, to reach 
the median rank) includes the same par- 
ticular items passed as a score of 30 
(with its corresponding rank) plus 
twenty more items, and that a score of 
6o includes all of the particular items 
going to make up a score of 50 plus ten 
more. In short, verification of these 
propositions requires either that raw 
scores represent the sum of particular 
items, not just any items passed, or that 
raw scores represent previously known 
increments of difficulty. According to the 
definition of this kind of group difficulty, 
th raw scores do not represent previously 
known increments of difficulty. And an 
empirical inspection of the items passed 
by individuals on a mental test seldom, 
if ever, reveals that any given raw score 
includes all the particular items passed 
in the preceding, smaller raw scores. 
The fact of the matter is that even 
when two persons pass the same fifty 
items in a test, there are no cogent 
erounds for concluding that each has 
overcome the same amount of those 
obstacles which go to make up psycho- 
logical “resistance” or difficulty. We know 
only that two persons successfully passed 
the same test item; we do not know that 


that test item was equally difficult for’ 


those persons to pass. Indeed, we do not 
yet have an operational definition to in- 
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dicate precisely what we mean by the 
difficulty of test items for individuals. We 
are simply aware that many qualitatively 
different factors undoubtedly enter into 
the difficulty of test items for various 
individuals, but we have not yet reduced 
the concept of individual difficulty to 
scientific language. Thus, it is scarcely 
surprising that group definitiors of difh- 
culty have been found neither convertible 
nor equivalent to individual difficulty. 
Perhaps it should be made clear at 
this point that if some phase of group 
difficulty, say achieving the tenth per- 
centile, is taken as a desirable standard 
of performance for some particular pur- 
pose (such as being promoted to the next 
grade), then any scientific interest in the 
unique differences in individual per- 
formance can possibly be ignored, and 
attention centered on the immediately 
practical question of whether these per- 
sons can or cannot meet the required 
standard of performance under certain 
pertinent conditions. The difference in 
degree represented here is that which dis- 
tinguishes between the generalizations 
and laws of science and the more imme- 
diate, particular solutions to practical 
problems of evaluation. Of course, both 
science and evaluation in psychological 
matters are (or should be) genuinely ex- 
perimental, but their methodological 
techniques and operations are signifi- 
cantly distinct. We must reiterate, there- 
fore, that the scientific meaning of diffi- 
culty is set by the operations of defining 
it, and accordingly this property of test 
performance will have scientific meaning 
only as a quality of group behavior. Just 
as Boyle’s law is a scientific measure of 
the behavior of a large group of gas mole- 
cules in a confined area and not of the 
behavior of the individual molecules, so 
quantification in terms of group-difficulty 
(when this concept is sufficiently clari- 





MENTAL TESTS AS INSTRUMENTS OF SCIENCE 35 


fied) is scientifically appropriate for the 
group (and other groups just like it) but 
not for the quantification of individual 
performances. 


NUMERICAL BASES OF MENTAL TESTING 


Two approaches to the quantification 
of test performances have been noted, 
one being appropriate for “speed” tests 
and the other for “power” tests. For the 
speed tests, the rate of the performance 
is quantified under specific conditions. 
In almost all cases of scientific import, 
difficulty is held constant, although the 
common practice noted earlier is to es- 
tablish the constancy of difficulty (by the 
ratio of successes and failures) under un- 
timed conditions and to make it so low 
that go-100 percent of the testees could 
pass all the items if they had enough 
time. Numerically speaking, quantifica- 
tion then consists of enumerating the 
items correctly answered within the time 
limit. Since both time and the level of 
difficulty are presumably held constant, 
there is a temptation to consider the 
items passed correctly as units of per- 
formance rate. But passing test items is 
hardly a continuous process. Each item 
is a discrete job and is indivisible in the 
scoring. With a few inconsequential ex- 
ceptions, it is either passed or it is not 
passed. Hence, it is probably more pre- 
cise to term this type of quantification 
enumeration rather than a form of meas- 
urement. 

In principle, norms based upon the re- 
sults of enumeration could be established 
for this kind of resting. However, the 
common practice is to seek a norm de- 
fined by the distribution of the scores of 
the group. The basis of this norm is the 
construction of a rank order of the varia- 
tion in performance rate within the 
group. To do this, the raw score totals 
are often converted into derived scores 


which employ ordinal numbers, such as 
percentiles and quartiles. For example, if 
20 percent of the group achieved scores 
up to 33 and if the scores of the last one 
percent fell between go and 33, then all 
cardinal sums between 30 and 33 for this 
group can be represented by the ordinal 
number 20, standing for the twentieth 
percentile in this group. Differences be- 
tween ranks represent differences in the 
number of items passed within the time 
limit by equally large percentages of the 
group. Of course the differences in rank 


. between various percentages of the group 


do not all represent equal numbers of 
test items passed. However, either rank- 
ing or enumeration may be applied with 
significance to the resuits of speed tests. 

1. In regard to “‘power”’ tests, where 
difficulty itself is the variable, the first 
type of difficulty to examine for its nu- 
merical bases is that which defines the 
difficulty of each test item in terms of the 
percentage of successes or failures on it 
by the group. Here speed is usually un- 
important, and the quantification at- 
tempts to express the amount of difficulty 
which the “average” or “lower third” or 
“upper quarter” of the group has over- 
come in a test performance. In this case 
we do not have equal amounts of diffi- 
culty. Items A, B, and C may be 10 per- 
cent, 20 percent, and go percent difficulty 
respectively, but the equal difference of 
10 percent in each case is only in terms of 
equal numbers of testees. Nothing is as- 


serted about the amount of increment in 


difficulty—we merely know that C is hard- 
er than B, and B is harder than A for the 
group as a whole. 

Thus, in order to describe numerically 
the combined performance of the group 
in difficulty overcome, some system based 
on ordinal numbers must be employed. 
The most obvious approach is to arrange 
the test items in, say, an ascending order 
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of difficulty. For example, if there are 100 
items in the test, each item could be 
represented by an ordinal number desig- 
nating its rank in an ordered series of 
difficulty from 1 to 100. But here arises 
a fundamental difficulty. If the group of 
testees worked on the test as a combined 
unit, as gas molecules exert pressure on a 
chamber or as temperature affects a 
thermometer, we could fully expect the 
combined group to proceed up the scale 
of difficulty, passing every item up to the 
one it failed on and not passing any 
item beyond that point. But group test- 
ing does not work this way. Individual 
performances must be séparately scored 
and combined in order to get the group 
performance, and, as was shown in the 
discussion of individual difficulty, any 
given individual may fail at several 
points on the difficulty scale and still 
pass several items above the ordinal rank 
of his early failures. How shall these in- 
dividual performances be scored? What 
usually happens is that the correctly 
answered items are counted, but this 
clearly violates the logic of the rank order 
and thus destroys the validity of any 
scientific inference we might make from 
the original data. 

An alternative method, where test 
items are grouped at increasing levels of 
difficulty, is to cease giving credit for 
increments of difficulty passed when the 
person first fails to do one half of the 
items of any one level correctly, no mat- 
ter what single items he may be able to 
do correctly on the remaining higher 
levels of difficulty, and regardless of the 
fact that he may have failed nearly half 
the items on the preceding lower levels 
of difficulty. While this procedure is un- 
questionably arbitrary in its scoring cri- 
terion and very rough in its determina- 
tion of rank, it has the virtue of observ- 
ing the spirit of the rank order logic. 


However, it is extremely laborious, and 
since on empirical grounds no great dif- 
ference has been found between the rank 
order achieved by this method and the 
rank order achieved by simply counting 
the correctly answered items wherever 
they occurred—a fact which testifies to 
the very rough and approximate charac- 
ter of the first technique of ranking—it 
is common practice to ignore the logical 
principles involved and to obtain ranks 
from the scores of items passed. ‘This pro- 
cedure can probably be justified for cer- 
tain practical purposes where one is not 
interested in building a_ psychological 
theory or establishing a law, but there ap- 
pears to be little prospect that a rigor- 
ously logical basis for scoring the rank 
of individual performances in difficulty 
overcome will be ‘achieved here, even 
with later refinements in testing proce- 
dures. Indeed, it may well be unachiev- 
able because of the fundamental unre- 
ality of current assumptions about the 
nature of test difficulty. 

2. The second concept of group difh- 
culty is most frequently used when the 
purpose is to quantify the performance 
of the group on a sliding scale, defined 
by the distribution of the group itself. 
This definition, it will be recalled, de- 
scribes the difficulty of the test as. a whole 
in terms of the distribution of the raw 
scores. Rather often the test items are 
presented to the group of subjects in an 
ascending order of difficulty (first defi- 
nition) for the encouragement value, but 
the scores are found by counting the 
number of items successfully completed, 
usually without regard to their relative 
position in the test. The items are 
“equal” only in the sense that they can 
be distinguished as discrete members of 
a test. They carry an unknown amount 
of difficulty for the group. Consequently, 
the raw scores are rightly considered as 
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having no meaning in themselves for the 
amount of group difficulty overcome. 

On the assumption previously dis- 
cussed that the distribution of difficulty 
overcome corresponds with the distribu- 
tion of raw scores, the raw scores ex- 
pressed in cardinal numbers are con- 
verted into derived scores expressed in 
ordinal numbers. The usual procedure 
is to arrange the raw scores from the low- 
est to the highest, find the mean or 
median score for the entire group (and 
also, in most cases, the extent of devia- 
tion therefrom), and then use this new 
standard of score distribution to describe 
the amount of difficulty overcome by 
various portions of the group. For in- 
stance, a raw score of 40 might be the 
point below which 25 percent of the 
scores of the group fell, and so 40 be- 
comes the twenty-fifth percentile of diffi- 
culty or the point of division between the 
first and second quartile of difficulty for 
this standard group. Other derived 
scores expressing ordinal rank are of 
course possible. This raw score of 40 
might represent the average difficulty 
overcome by sixth graders in Middle- 
town or the thirtieth percentile for 
twelve-year-olds. In each case, however, 
the score gets its meaning as difficulty 
overcome from a particular group and 
from the assumption that the difficulty 
overcome varies approximately as the 
raw scores (66, 342). 

If the basic assumption of this defini- 
tion of difficulty could be verified, there 
could be no question of the practice of 
ranking the performance of groups, or 
portions of groups, in terms of difficulty 
overcome. However, many investigators, 
not content with the possibility of rank- 
ing, have sought to achieve the formal re- 
quirements of measurement by establish- 
ing equal increments of this kind of dif- 
ficulty. Apparently not all of these have 





been interested in contributions to 
science, since their procedures are based 
on the following argument: There is 
some evidence to show that when a test 
contains a very large number of items, 
the increments of difficulty between adja- 
cent items are so small as to approach 
equality, and “the resulting series of 
scores may thus approach a status not 
unlike that in which cardinal numbers 
apply and the scores may be treated as 
if they actually were cardinal numbers. 
This is what the mental testers have 
done, usually without being at all both- 
ered about the logic of the situation” 
(16, 4). Obviously this is pseudo-measure- 
ment and pseudo-science. 

A more reputable attempt to achieve 
equal units of difficulty has been through 
the use of the normal curve. Its use is 
probably the chief support to the claim, 
when the claim is actually made, that 
tests are or can be measuring devices. 
Thorndike, who has done some of the 
most exacting work with mental tests, 
attaches the following significance to the 
distribution curve of raw scores: 


“Knowledge of the differences in difficulty 
... may be had if (a) the form of distribution 
of the varying conditions of the ability in an 
individual is known; or if (b) the form of 
distribution of the varying abilities of the 
individuals in a group is known. . .. Whatever 
distribution is approximated by the average 
of these distributions, and more closely when 
the scores for two or more are averaged than 
when they are used singly, has a strong prob- 
ability of being near to the form of distribu- 
tion which the altitude of that ability would 
show if measured in truly equal units” (55, 
482-83). 


His method of proving that the distri- 
bution in each case is normal will be 
considered in the following chapter. For 
the present only the significance of the 
normal curve for measurement will be 
considered. 
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THE NORMAL CURVE 


The normal curve is a mathematical 
concept. As such, it is a hypothetical 
form available for use with any data that 
fulfill its logical requirements. Like all 
other mathematical concepts, the normal 
curve is essentially independent of any 
particular material which may compose 
its content. In practice, however, the 
name of the normal curve has been 
qualified in three different ways, each 
arising from the kind of existential data 
to which the curve has been applied 
(15, 83-84). It is sometimes called the 
normal probability curve because it gives 
the theoretical probability of the occur- 
rence of phenomena appearing by 
chance. For example, the empirical re- 
sults in “heads” and “tails” of tossing 
ten coins over an extended period come 
very close to the theoretical expectancy 
given by the curve. It is also called the 
normal frequency curve because the oc- 
currence of many measured facts in na- 
ture often assumes this distribution. For 
a standard example, when the measured 
heights of a cross-section sample of Brit- 
ish males were plotted for frequency, the 
normal curve was approximated. It 
should be noted that both of these ex- 
amples record frequency, but that the 
first was in terms of the appearance of a 
discrete event and the second in terms 
of variations in a conttnuous measured 
property exhibited by members of the 
group. Finally it is sometimes called the 
normal curve of error because actual re- 
peated measurements of the same phe- 
nomenon tend to diverge from the “true” 
measure according to this form. Psycho- 
physics has found it also applies to errors 
in judgment of the weight, length, or in- 
tensity of a phenomenon. | 

A point of crucial significance should 
be established at once. ‘The normal curve 
as a curve of error applies only to varia- 


tions expressed in the previously estab- 
lished equal units of measurement. The 
normal frequency curve and the normal 
probability curve may apply to measured 
data if the results of measurement ap- 
proximate a close fit, but these curves 
are also applicable to frequencies of non- 
measurable events. As an example of the 
latter case, coin-tossing is not a measur- 
able phenomenon. It is simple counting 
of the frequency of occurrence of discrete 
events. Thus, the use of the normal curve 
is not restricted to measurable data. 

But let us look further. What of the 
units derived from the properties of the 
normal curve, such as sigma units? Are 
not these capable of expressing in equal 
units all data that fit the normal curve? 
These questions require an examination 
of the construction of the normal curve. 

The plotting of the normal curve re- 
quires two axes. The perpendicular or 
y-axis is laid off in equal steps represent- 
ing an accumulation of frequencies. 
This, as we have just observed, is not 
measurement but counting. The steps 
are made equal as a matter of mathe- 
matical necessity, but the actual distance 
between steps is a matter of arbitrary 
choice. The horizontal or x-axis must 
also be laid off in equal steps as a matter 
of mathematical necessity. If the data are 
the results of coin-tossing, the possible 
numbers of “heads” occurring on any 
throw become the equal steps. Be it 
noted that these are not equal steps of 
anything, as in measurement, but simply 
constitute a graphic convention necessary 
to the plotting of the curve. On the other 
hand, if the data are the results of meas- 
urement, say of the height of British 
males or of repeated measures of the 
length of a table, the equal steps on the 
x-axis are actually equal units of meas- 
urement: inches. They are equal steps 
of something: the height of men or the 
length of a table. 
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Now what of sigma units or probable 
error units derived from the normal 
curve? In the case of coin-tossing, the 
appearance of four heads may be one 
sigma away from the appearance of five 
heads, assuming the appearance of five 
heads to be the median. Mathematically 
speaking, one sigma is equal to any other 
sigma under the normal curve. In the 
above example, the first sigma unit from 
the mean is the mathematical or hypo- 
thetical “distance,” represented on the 
base line as lying between four heads and 
five heads, in which roughly one third of 
the cases fall. Such information is un- 
deniably useful, but we must be quite 
clear that the word “distance” is essen- 
tially hypothetical and has no existential 
reference whatsoever. 

Such is not the case with the curves 
depicting the measurement of height or 
of length. Here the base line is laid out 
in inches. Let us assume again that the 
occurrence of the measure four inches 
is one sigma away from the occurrence 
of the measure five inches, the mean. 
Then it will follow that the occurrence 
of the point, three inches, is two sigmas 
from the mean point, five inches. As in 
the previous case, the sigma units in 
themselves still indicate the base line 
“distances” above which certain percent- 
ages of the total cases fall, but in the 
present instance these “distances” can be 
translated’ into actual amounts. We can 
say that approximately two thirds of the 
cases fall within one inch of either side 
of the mean, and that around ninety-five 
percent of the cases fall within two 
inches of either side of the mean. 

Why are such statements possible 
about measured data and not about such 
discrete events as coin-tossing? There is 
only one answer. The first data were 
measured previously to the fitting of the 
curve, and were thus known to vary on 


a linear scale with respect to the quan- 
tity of some property. The second data 
were not previously measured—in fact, 
were not measurable—and their assign- 
ment. to a scale position on the x-axis 
was arbitrary and had no existential 
reference. 

Independently of the several kinds of 
data which may fit the normal curve, 
the precise form of the curve is derived 
from the logic of probability. The logical 
conditions of probability are worth ex- 
amining, because no probability predic- 
tions for given data are warranted unless 
those data closely fulfill these logical 
conditions. The formal nature of proba- 
bility may be defined as follows: 


“If an event can happen in a certain num- 
ber of distinguishable ways, and if some of 
the ways be regarded as favorable, then the 
ratio of the number of favorable ways to the 
total number of ways is called the probability 
of the event occurring favorably, PROVIDED 
the total number of ways of occurrence be re- 
garded as equally likely” (16, 72). 


In terms of this definition, the first 
logical condition which the data must 
fulfill is that the ways in which the event 
can occur must be known—i.e., con- 
trolled. For example, in tossing coins, 
conditions are so arranged that only 
“heads” or “tails” will occur, or in meas- 
uring the height of men, only differences 
in inches will occur. 

A second logical condition is contained 
in the provision that the occurrence of 
the events be regarded as equally likely. 
In the case of tossing several coins, this 
means that there should be no apparent 
reason why any one coin should fall 
“heads” oftener than “tails.” In the case 
of plotting the height of men, this means 
that the many factors that help deter- 
mine the height of men should be 
equally likely to operate. A common way 
of summarizing this logical condition is 
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to say that the events must occur by 
chance. By chance we do not mean ir- 
rational behavior or absence of any de- 
terminism. Instead a very complex type 
of determinism is implied. “ ‘Chance’ 
may be defined as the result obtained 
from the operation of a great many fac- 
tors, none of which is dominant, or, put in 
another way, all of which are (relatively) 
similar, equal, and independent” (15, 82). 

Mathematically speaking, the normal 
curve may be defined as the expansion 
of the binomial theorum, Empirically 
speaking, it is closely fulfilled when data 
fit the above conditions of homogeneity 
and chance. Consequently, the following 
constitutes an empirical definition of the 
normal curve: “The ‘ideal’ polygon or 
normal curve, therefore, may be said to 
represent the relative frequency of oc- 
currence of various combinations of a 
very large number of equal, similar, and 


independent factors, when the chances of | 


the occurrence or non-occurrence of each 
factor is the same” (15, 81). 

Three aspects of this definition are 
basic to the scientific use of the normal 
curve concept: (1) A group of events 
may be expected to occur according to 
the normal curve only if the possible 
factors producing these events are very 
large in number and equally free to op- 
erate. (2) The events must be homoge- 
neous with respect to some common at- 
tribute, such as “coming heads,” or “‘be- 
ing white,” or “length in inches.” The 
homogeneity of the events or phe- 
nomena is a matter to be determined 
empirically before the curve is. fitted, and 
not, as Boring (4, 30) warns, a matter 
to be deduced from the fact that the 
curve happens to fit some phenomena. 
Moreover, as we saw earlier, the common 
attribute need not be a measurable prop- 
erty, but it must at least be distinguish- 
able as homogeneous. (3) The common 
attribute of the events or phenomena 


must vary only on a quantitative linear 
scale of pre-determined units. In the 
case of coin-tossing, the pre-determined 
units (i.e., “heads” and “tails”) are dis- 
crete items for enumeration but are not 
measures of amount. But even though 
they do not represent continuous incre- 
ments of amount on the base line of the 
curve, they are given a hypothetical or 
mathematical equality for purposes of 
plotting the curve. However, in the case 
of variations in the height of British 
males, the units are genuinely existential 
and their equality is predetermined by 
the actual measuring process. 


NORMAL-CURVE UNITS OF DIFFICULTY 


It was shown earlier that variations in 
the amount of group difficulty overcome 
could be ranked by derived scores on the 
assumption that the distribution of 
group difficulty varied as the distribution 
of the raw scores varied. For the purposes 
of ranking by percentiles or quartiles, no 
particular attention was paid to the 
form of the distribution. When the in- 
tent is to use the normal curve, it be- 
comes very important that the frequency 
distribution of the raw scores made by 
the group closely approximate a normal 
curve distribution. This is relatively 
easy to do, for it simply involves satisfy- 
ing the logical conditions of a normal 
frequency distribution—i.e., making the 
achievement of the raw scores depend 
upon a very large number of equal, simi- 
lar, and independent factors. This. is 
usually accomplished by using a large, 
unselected group of testees. The process 
of constructing tests so as to insure the 
likelihood of a normal distribution of 
raw scores from a specified group is often 
called test standardization (14, 354). 

It should be said in passing that the 
attempt to obtain a normal distribution 
of raw scores is much more than a matter 
of convenience to many test builders. 
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They believe that variations in the 
amount of ability in members of the 
group are normally distributed, an as- 
sumption which is equivalent in practice 
to saying that the variations in amount 
of difficulty overcome by the group is 
normally distributed (66, 346). The dis- 
tribution of difficulty, as we have seen, 
has already been assumed to vary as the 
distribution of raw scores, so, if both of 
these assumptions are true, a test which 
is built to elicit a normal distribution of 
raw scores is by logical assumption the 
truest index of the ability variations in 
the group. Although the assumption that 
variations in the “true” ability of the 
group are normally distributed will be 
considered in. detail in the next chapter, 
it will suffice to say here that this assump- 
tion has not been verified as yet. As a 
consequence, the choice of a normal dis- 
tribution of raw scores (and, as a corol- 
lary, a normal distribution of group dif- 
ficulty overcome) is not forced upon the 
investigator by the scientific nature of 
his data. By altering the test items, 
other distributions of difficulty could be 
obtained. 

After a mental test has been standard- 
ized—i.e., constructed so that the distri- 
bution frequency of the raw scores of a 
specified group will be approximately 
normal—the next question concerns 
which kind of a normal curve is repre- 
sented by the raw scores. Is it a frequency 
curve of discrete events, such as the ap- 
pearance of “heads” in coin-tossing, thus 
indicating the probability of re-occur- 
rence of those frequencies? Or is it a fre- 
quency curve of the measurements of 
many objects in a certain class, such as 
the height of men? Or is it the curve of 
error, representing the central tendency 
of many measurements of the same 
thing? These questions are answered by 
inspection of the actual data making up 
the curve, and the answer is important 


because it determines the existential 
meaning to be ascribed to derived scores, 
like sigma units, which are based on the 
properties of the normal curve. 

When a test of the variation in per- 
formance rate of the group is “standard- 
ized,” the resulting normal distribution 
of scores represents the frequency curve 
of enumerated data. Here the divisions 
along the base line are the number of 
items done correctly within a specified 
time limit, difficulty presumably being 
held constant at a very low figure. Sigma 
scores, besides being divisions of group 
dispersion, would represent “distances” 
along the base line embracing equal 
numbers of equally difficult items done 
correctly within the time limit. These 
“distances” refer existentially to “how 
many” of a number of similar, discrete 
test items—not, in this case, to “how 
much” of some continuous property. 
Thus, the sigma scores simply present a 
different cardinal series for enumerating 
the items passed—something like count- 
ing by 5’s rather than by 1’s. This con- 
version is possible because the original 
raw scores plotted along the base line 
of the normal curve constituted a cardi- 
nal series of equally difficult test items. 

_ When a test, designed to indicate vari- 
ations in the difficulty overcome by the 
group and based on the assumption that 
this difficulty varies as the distribution 
of the raw scores, is standardized, a 
normal distribution of the frequency of 
occurrence of certain discrete events is 
significantly different. In the example 
above, the discrete events were various 
totals of equally difficult test items. In 
this example, the discrete events are 
various totals of only similar-looking test 
items. The probable variations in diff- 
culty of the test items are unknown. The 
test items satisfy the conditions prerequi- 
site to counting, not because they are 
“equally difficult” or “equal increments” 
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of difficulty, but only because they are 
“equally distinguishable” from each 
other as separate test items. 

At this point the assumption that the 
distribution of raw scores corresponds to 
the distribution of group difficulty over- 
come is introduced. The higher the score 
total, the greater the difficulty overcome. 
In terms of this assumption, the base line 
of the normal curve is now viewed as a 
continuum of increasing difficulty. Ad- 
mittedly the raw scores do not represent 
equal increments of difficulty—in fact, 
we are asked to assume that they even 
represent unequal increments of diff- 
culty. So now another assumption is nec- 
essary: that in a large, unselected group 
of persons the factors which constitute 
test difficulty are probably large in num- 
ber and equally free to operate so that 
the difficulty experienced by this group 
should be normally distributed among 
its members. This condition will be re- 
flected in the normal distribution of the 
raw scores of the group, since the as- 
sumption has already been made that 
the distribution in difficulty overcome 
corresponds to the distribution of the 
raw scores. In short, a normal distribu- 
tion of raw scores (and hence of diff- 
culty) should be the most accurate pic- 
ture of the difficulty actually overcome. 

In one can accept the logic up to this 
point, the usual conclusions drawn are 
unescapable. The raw scores are plotted 
along the base line (the assumed con- 
tinuum of difficulty) as though the dif- 
ference in difficulty between scores 46 
and 48 is the same as the difference be- 
tween scores 59 and 61. One does not 
have to argue that these pretended equal 
differences are actually true, for if |the 
resulting curve is not quite normal, ;the 
numerical value of the raw scores can 
be readjusted to make a perfectly normal 
distribution. Normality is the test of an 
accurate distribution of difficulty, and 


sigma scores become unquestionably the 
units describing equal increments of dif- 
ficulty for that group. Here, ostensibly, 
is the achievement of the measurement 
of difficulty, replete with a continuum 
and equal units. 

The entire argument, of course, is cir- 
cular and self-deluding. First, the act of 
adjusting or reweighting the raw scores 
for a more nearly perfect fit to the nor- 
mal curve removes all possibility of ever 
verifying the primary assumption that 
test difficulty (or the ability to over- 
come it) is really normally distributed 
among an unselected population. This 
amounts to guessing at the form of dis- 
tribution and then correcting the origi- 
nal units to fit. It would be just as “scien- 
tific’—really unscientific—to guess some 
other distribution. One cannot advance 
scientific inquiry by forcing the normal 
curve on a distribution of unknown 
form, especially when one is uncertain 
about what is being distributed. More- 
over, the above argument from the as- 
sumption that the true distribution 
should be normal leads straight back to 
that assumption, unbroken by any criti- 
cal act of experimental verification. 

In the second place, the assumption 
that the distribution of difficulty corres- 
ponds to the distribution of the raw 
scores also never receives any critical 
verification in the above reasoning. The 
original data were a set of scores derived 
from totaling for each person the num- 
ber of similar-looking items of unknown 
difficulty which he successfully passed. 
The frequency of the totals may be plot-: 
ted on a graph, but the intervals be- 
tween scores are made arbitrarily equal 
for the sake of plotting and drawing a 
curve. Thus these intervals have only 
a hypothetical equality—they refer to no 
existential units of some property like 
difficulty unless they are simply assumed 
to do so. Moreover, if the distribution of 
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frequencies happens to be normal, the 
reference of the sigma units to a distance 
along the base line is only a logical or 
mathematical reference, appropriate to 
describing the variability in a group per- 
formance but not appropriate to estab- 
lishing equal units of the amount of 
difficulty overcome. 

The simple conclusion is that in no 
case can the normal curve provide equal 
units of any existing property unless that 
property is first describable in equal 
units independently of its frequency 
curve on a graph. The attempt to use 
the assumption of normal distribution as 
the criterion for existential facts is a 


patent case of the old philosophical 


habit of trying to get factual knowledge 
out of sheer logic. We have seen how 
it ignores the scientific requirement that 
equal units are established, not by defi- 
nition, but by experimental operations 
of adding the ‘property in question and 
empirically testing the results. But this 
assumption is also used in other ways 
inappropriate to a science. Some investi- 
gators attempt to establish the homo- 
geneity of a desired kind of difficulty by 
the test of whether it gives a normal dis- 


tribution of raw scores. The same as-” 


sumption is used as grounds for elimi- 
nating certain deviations in performance 
as “chance errors.” The scientific prin- 
ciple violated in each case is that data 
should not be forced to fit a preferred 
logical form, but that logical forms 
should be developed which will accu- 
rately fit the data. 

In brief summary, this chapter has 
been concerned with identifying the 
types of quantification employed by the 
kind of mental tests which seem to hold 
significance for a science of psychology. 
The critical property for quantification 
was found in all cases to be difficulty, 
variously defined. The usual definitions 


of difficulty do not appear sufficiently 
isolable as specific, unambiguous com- 
ponents of a test performance to become 
the subject matter of a basic science. 
There are still too many grounds for 
doubt that difficulty is a homogeneous 
property varying only quantitatively be- 
tween groups and between age levels. In 
terms of current definitions, difficulty 
seems to be an intermittently dichoto- 
mous quality rather than a continuum. 
Present control of the ill-defined factors 
composing difficulty is based largely on 
the canceling effect of chance variations 
in these factors, a control which is per- 
haps effective for large groups but not 
adequate for scientific comparisons be- 
tween individuals. Virtually all mental 
tests define difficulty in terms of the 
group instead of in terms of individuals. 
Current concepts of difficulty are not ap- 
plicable, either by definition or by ex- 
perimental verification, to the difficulty 
encountered and overcome by various 
individuals in their test scores. 

Two types of quantification by mental 
tests were found—enumeration and rank- 
ing. In tests of limited time over items 
of constantly low difficulty, both enumer- 
ation and ranking are 2ppropriate means 
of quantifying the results. In power tests 
built exclusively from items of deter- 
mined difficulty, the only logical means 
of quantification is ranking, but it is 
still so rough and arbitrary that its re- 
sults are indistinguishable from those 
achieved by non-logical enumeration. Jn 
power tests which rest on the assumption 
that the range of difficulty corresponds 
to the range of scores, ranking is the only 
conceivable method of quantification for 
scientific purposes. The use of the nor- 
mal curve to derive equal units of dif- 
ficulty is based on circular, fallacious, and 
unscientific reasoning and holds out no 
prospects of success for future study. 

















CHAPTER V 


MENTAL TESTS AND PSYCHOLOGICAL —2THEORIES 


N THE analysis of the preceding chap- 
I ter, tests made a rather poor showing 
as scientific instruments of quantifica- 
tion. Speed tests appeared to fulfill ade- 
quately the technical requirements of 
enumeration and of ranking, but their 
potential scientific significance was viti- 
ated by the present ambiguity and _ re- 
stricted applicability of the difficulty 
concept. Power tests barely approxi- 
mated the requirements of ranking and 
gave little promise of sufficient future 
refinement, by the nature of the assump- 
tions involved. Neither type held any 
immediate hope for the treatment re- 
quired in establishing laws and building 
theories. The conditions appropriate to 
measurement in the technical sense were 
not met by any of the types of test con- 
struction considerd. The prospect for a 
scientific quantification of human abili- 
ties through current types of mental tests 
appears dim at this point. 

But there are several other uses of 
mental tests, some of which have scien- 
tific purposes and others of which have 
the practical purposes of evaluation. In 
matters calling for particular evaluations, 
the usefulness of. many current tests is 
well recognized. They are valuable engi- 
neering devices to help meet the practi- 
cal problems of the school and of the 
employer, and fully deserve the attention 
they are receiving toward improving 
these services. On the scientific side, in 
spite of their apparently limited signifi- 
cance as instruments of quantification, 
mental tests may still contribute toja 
science of psychology by producing other 
than measured results. Not all scientific 
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laws and generalizations are expressible 
in terms of quantities. Some laws, such 
as the classificatory generalizations of bi- 
ology and botany express a constant but 
non-quantitative relationship. Others, 
such as those in qualitative analysis in 
chemistry, provide a critical test of the 
presence or absence of some primary 
element. 

There is considerable recognition of 
this possible role for mental tests. One 
writer concedes “that everyone knows 
that mental tests do not really measure 
mentality in the sense that meter sticks 
measure length or that thermometers 
measure temperature,” and then re- 
minds us pointedly of the distinction be- 
tween “test” and “measurement” in the 
physical sciences: “A test is a critical 
experiment or operation that determines 
a relationship; a measurement is an op- 
eration that determines an amount (27, 
200). Perhaps this meaning may become 
the appropriate one for those mental 
tests which venture to establish psycho- 
logical laws or generalizations. From 
another source we have this statement: 
“The essential characteristic of a mental 
test is not that it affords a measurement, 
a numerical description, but that, as we 
have reiterated, it is a situation carefully | 
selected or designed with a view to elicit- 
ing responses as free as possible from 
ambiguity in their psychological indica- 
tion” (19, 54). One of the leading test 
experimenters at the present observes 
that a mental test “is a sampling device 
rather than a measuring device in the 
proper sense of the term, and... . its 
validity therefore depends upon the prin- 
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ciples of sampling” (35, Pt. I, 376). 

While it may be admitted that experi- 
mental psychology has established few, 
if any, psychological laws as yet, it is 
more important to be on the right track, 
to learn how to ask the questions that 
can receive scientific answers. Of special 
concern to this discussion is the question: 
Can mental tests, however their function 
may be defined, be so used that their 
results will serve as sound bases for psy- 
chological laws and generalizations? In 
answering this question, we shall not be 
concerned with mental tests which are 
frankly devoted to particular evalua- 
tional problems, such as classifying the 
efficiency of individual performances in 
some area in terms of a value standard 
defined by the average performance in 
the group. The best tests of this type, 
such as Terman’s revision of the Stan- 
ford-Binet (50), are genuinely experi- 
mental in construction but they usually 
lay but little claim to scientific conclu- 
sions. 

By scientific conclusions, we mean 
those existential propositions concerning 
human behavior which, by being essen- 
tially dependent only on the human pur- 
pose of universal objectivity, are largely 
free from more variable purposes and 
are consequently available for realizing 
a great number of purposes. Rather than 
being particulars, they are generaliza- 
tions circumscribed by highly repeatable 
conditions. For example, in the science 
of psychology, answers to the following 
questions would be characteristically sci- 
entific conclusions: What common fac- 
tors will explain why persons differ in 
their ability to do certain things? To 
what extent are these factors organic or 
native and to what extent are they social 
or acquired? Can native intelligence be 
isolated and operationally defined? What 
universal relationships in human _be- 


havior will give control and prediction of 
changes in individual behavior under 
critically controlled social conditions? 

If mental tests are to make contribu- 
tions to such questions as these, the test- 
ing procedure ‘may be expected to take 
account of careful operational defini- 
tions, the development of a coherent 
theory, the formulation of verifiable hy- 
potheses, and their verification under 
controlled conditions. To the extent that 
current tests are directly intended to 
achieve basic scientific results, we shall 
not only examine their present efficiency 
but attempt to identify what conditions 
must be satisfied for their future improve- 
ment as tools of science. 


DEFINITIONS OF ABILITY 


In common parlance the term “abil- 
ity” is used in several senses. These mean- 
ings may be conveniently classified under 
three concepts: organic ability, social 
ability, and behavioristic or test ability.® 
Of course, not all definitions of ability 
under these headings have scientific sig- 
nificance, but definitions of each type 
have entered into the construction of 
tests designed for scientific purposes. 

1. The organic concept of ability re- 
fers to the internal, presumably neuro- 
logical capacity of a person to act in 
certain ways. It gets its meaning through 
the separation of what a person does 
(performance) from how he is able to do 
it (ability). Consequently, a person's 
ability is not accessible directly (at least, 
as yet) but only indirectly through his 
performance. This concept of ability has 
been in rather frequent use among men- 
tal testers. For example, Monroe says: 
“Briefly, the measurement of an ability 
consists of securing a quantitative de- 
scription of the performance which that 


* Suggested by F. N. Freeman (35, Pt. I, 12-14). 


aS Bor eee SILI (ete Oe 


> ne 


Se ee 





CURE OR er 


~ ee 














46 LAWRENCE G. THOMAS 


ability produces under specified and con- 
trolled conditions” (31, 20). According 
to Eurich and Carroll: “It is impossible 
to measure an individual's ability di- 
rectly”; and then, “The performance of 
an individual represents the ability, apti- 
tude, or accomplishment that is being 
measured” (12, 66). In other words: ““The 
educational or mental test measures only 
the performance of the individual... . 
The actual ability, or general power to 
do, can only be inferred from the per- 
formance of the individual on the test” 
(67, 354). 

The organic concept of ability, when 
used in connection with mental tests, 
thus becomes a hypothetical construct 
which generalizes into a common homo- 
geneous factor the known and unknown 
causes of the differences in the perform- 
ance of persons in any given set of tasks. 
We have good reason to believe that a 
person’s motives, enthusiasm, past ex- 
perience, state of health, inhibitions, 
complexes, glandular secretions, neural 
connections, and the like, are associated 
in some causal capacity with the per- 
formance he produces, but at least some 
of these factors are assumed to unite in 
composing a relatively stable, persisting 
quality which may be designated as an 
ability. The assumption of a stable, per- 
sisting quality to the ability is necessary 
in order to justify repeated applications 
of the same or a similar test for the pur- 
pose of eliminating “chance” factors. A 
further common assumption regarding 
the nature of this concept of ability is 
that the amount of ability varies directly 
as the difficulty overcome in the per- 
formance, an assumption which is sub- 
ject to all the qualifications of difficulty 
discussed in the preceding chapter. 

A critical point at issue when mental 
testing uses the organic concept of abil- 
ity, especially in regard to intelligence 


| 


testing, is whether the ability producing 
the tested performance is very largely 
native or the combined product of na- 
ture and nurture. All experimenters 
would probably agree that the effective 
ability producing the performance in 
question is unescapably a product of na- 
ture and nurture. For instance, Thorn- 
dike’s very careful definition of the in- 
telligence producing performances on 
his CAVD Test includes “not only the 
native, inherent capacity which a person 
has for such successes, but also whatever 
education has added thereto, and what- 
ever increment of success with intellec- 
tual tasks he has by virtue of working 
with better intellectual tools” (55, 95). 
However, the express intent of many ex- 
perimenters has been to test ultimately 
an intelligence “which is general rather 
than specific, a capacity which is native 
or inborn rather than acquired through 
education” (26, 21; 5, 19). For that rea- 
son, considerable effort has occasionally 
been expended to equalize the effects of 
nurture on the testees so that the remain- 
ing differences between them could logi- 
cally be attributed. to differences in na- 
tive endowment. 

2. The social concept of ability is usu- 
ally some broad formulation based on 
common observations of how effectively 
various people deal with particular prob- 
lems in a particular culture. The social 
concept and the organic concept are not 
mutually exclusive but neither are they 
equivalent. Social concepts of ability are 
seldom sufficiently precise and unam- 
biguous to be used in scientific inquiry, 
but they may and frequently do stimu- 
late such inquiry. For example, Francis 
Galton’s social concept of a genius as 
one who attained an intellectual rank 
held by only one in a million, and of an 
eminent man as one “who reached the 
position attained by one person in 
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4,000,” (63, 45) is the forerunner of the 
current statistical concept of intelligence 
groupings. And Thorndike began his 
painstaking definition of CAVD Intellect 
from the premise of the generally recog- 
nized difference between “idiots and Aris- 
totles” (55, 63-64). 

While social concepts of ability may 
lead to scientific investigations, they are 
not the criteria of validity for scientific 
results. As we noted in Chapter II, 
science has slowly developed its own 
criteria for the validity of its existen- 
tial propositions—operational definitions, 
critical verification, and so on. Experi- 
mental studies, whether intended to be 
scientific or evaluational, which actually 
depend on correlation with selected so- 
cial concepts of an ability (e.g., school 
marks) for the validity of their results 
have thus limited their validity to that 
of the particular social concept selected. 

3. The behavioristic or test concept of 
ability represents a direct attempt to em- 
ploy unequivocal physical operations in 
the definition of a concept. Under this 
concept no separation is made between 
ability and performance. Rather the abil- 
ity is a functional relationship between 
a purposing individual and a proble- 
matical situation (i.e., the test). The sig- 
nificant feature of this concept is that it 
avoids the dualism between ability and 
performance which characterized the or- 
ganic concept. As will be seen later, how- 
ever, many of the technical problems in 
the experimental control of conditions 
are common to both concepts. Thus, the 
chief problem with the organic concept 
in mental testing is so to purify the per- 
formance that it represents nothing but 
the ability in question, while the chief 
problem with the behavioristic concept is 
so to control the test conditions that the 
critical relationship is “pure,” and can 


be generalized or used in a wide variety 


of situations. Practically, this simply 
means specifically including some and 
excluding others of those aspects of the 
total test situation which could be sus- 
pected of having an appreciable influ- 
ence on the relationship under examina- 
tion. 

Of the two types of experimental ap- 
proaches in which the test concept of 
ability has been used, one may be called 
the functional approach. The experi- 
mental aim in this case is to determine 
how well a group of persons adapt to 
(i.e., solve correctly) a test situation. The 
attempt is made to construct the test 
items so that the responses elicited will 
constitute a psychologically unambigu- 
ous relationship between the purposing 
members of the group and the goal de- 
fined by the test. In practically all cases 
the relationship of correct adaptation 
(e.g., overcoming a specified kind of diffi- 
culty) is qualified by a group standard 
rather than by an individual standard. 
In conformance with the defining opera- 
tions, the significance of this behavioris- 
tic ability for other situations less arti- 
ficial than the test situation is a matter 
to be established later. As a possible 
method of scientific investigation, this 
functional approach is basically empiri- 
cal and is best represented by the work 
of Thorndike, whose procedures will be 
examined in a later section. 

The other type of experimental ap- 
proach which uses the test concept of 
ability may be called the analytical ap- 
proach. It seeks the elementary com- 
ponents or factors which may be said 
to constitute the functional test ability. 
For some investigators the object is to 
discover psychological components or 
elements in the test ability, but the pre- 
liminary defining operations are exclu- 
sively mathematical. According to 
Thurstone, who is an outstanding ex- 
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ponent of this approach, statistical ‘“‘con- 
sistency among supposed indices of a 
trait justifies the postulating of the trait” 
(59, 102). And on the next page: “If 
consistency is discovered in a group of 
indices whose unity has not hitherto 
been known, then we are justified in 
postulating a new trait and in naming 
it." While this approach is basically sta- 
tistical, as distinguished from empirical, 
it aims also at achieving scientific con- 
clusions, in this case concerning the ele- 
ments of human abilities. In Thurstone’s 
words: “As soon as some of the primary 
abilities have been isolated, detailed 
studies of inheritance should be under- 
taken” (60, 52). Accordingly, ‘Thur- 
stone’s work will also be appraised in a 
later section. 


SCIENTIFIC THEORIES OF ABILITY 


The significance of a theory to scien- 
tific research was developed in an earlier 
chapter. Briefly the importance of a sci- 
entific theory may be summarized in the 
following functions: (1) It includes all 
known facts and laws in the field of 
inquiry, giving a unified explanation and 
control of current knowledge. (2) These 
known laws and generalizations can all 
be logically deduced from the postulates 
of the theory, a fact which systematically 
rules out alternative explanations and 
enhances our confidence in the validity 
of any individual law. (3) From these 
same postulates can be deduced a host 
of unverified hypotheses, which give di- 
rection and profound significance to fur- 
ther experimentation—that is, the future 
tenability of the theory depends upon 
their verification. 

This last function is the most im- 
portant of the three, because the pres- 
ence of a number of verifiable but un- 
verified hypotheses is really the test of a 
eood scientific theory. If no hypotheses 


are deducible from a set of postulates, 
even though these postulates may be es- 
tablished facts, there is no theory. If the 
hypotheses deduced are capable of only 
logical demonstration and not experi- 
mental verification, then the theory is 
insofar not scientific, but metaphysical or 
hypothetical. If the hypotheses deduced 
are falsified in experiment, then the 
theory too is false. But if the theory 
continues to provide verifiable hypothe- 
ses which continue to be verified, it is 
accepted as sound and scientifically true. 

Until recently, the overwhelming ma- 
jority of scientific theories, especially in 
the physical sciences, were of the type 
known as atomic (40, 179). That is, a 
law describing some aspect of a theo- 
retical system was credited with holding 
just the same whether that aspect was 
considered independently or in conjunc- 
tion with other aspects of the system. For 
example, it was once believed that 
Boyle’s law describing the pressure ex- 
erted by a gas in a chamber at a specified 
temperature held for each individual gas 
molecule independently of the number 
of gas molecules in the chamber. Thus, 
the total pressure observed would be the 
simple additive effect of all the gas mole- 
cules in the chamber. More recently the 
discovery of an effect called the Brownian 
movement at extremely low pressures 
has modified this view, and the accepted 
interpretation is that Boyle’s law of gas 
pressure is definitely a function of the 
mass of molecules acting in interdepend- 
ent but non-additive fashion. The name 
for this latter type of system is organic, 
or, especially in psychological questions, 
organismic. Since the advent of the rela- 
tivity doctrine, there has been more of 
a tendency to build some scientific theo- 
ries in physics around organic systems in 
order to account for all the facts. The 
trend from atomistic to organismic theo- 
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ries in psychology is undoubtedly well 
known to the reader. The important 
feature of the organic system is that its 
parts act differently in isolation from 
the way they act as members of the 
system. Obviously this latter type of sys- 
tem is much more difficult to investigate 
scientifically than is the atomic type of 
system, chiefly because scientific controls 
are so much more difficult to obtain in 
an interdependent system. 

This introduction is sufficient for a 
preliminary consideration of the cur- 
rent theories associated with the men- 
tal test concept of human abilities. Al- 
though the theories vary considerably, 
they have a few significant characteris- 
tics in common, One of their common 
assumptions is that the ability in ques- 
tion is qualitatively the same for all 
persons, and that its variations between 
persons are consequently only quantita- 
tive. This assumption is applied not only 
to persons of the same age but also to 
persons on different age levels. Another 
common assumption is that the elements 
of a test ability, however they may be 
grouped, operate as an atomic system. 
This means, among other things, that 
the performance relationship between a 
person and the test is a sum of the addi- 
tive effects of whatever elements compose 
this expressed ability. Possibly because of 
the statistical difficulties involved, an or- 
ganic theory of test ability has not yet, to 
the writer’s knowledge, been formulated 
and proposed. 

There are at present not one but sev- 
eral rival theories of reputable standing 
to explain test abilities, simply because 
the work of deducing and verifying criti- 
cal hypotheses from each has not pro- 
ceeded far enough to establish any as the 
theory which correctly fits all the facts. 
One of the earliest held by mental testers 
is the unifactor theory. It posits a large 


number of highly particularized com- 
ponents which in aggregate form consti- 
tute, to take the typical example, in- 
telligence or intellectual ability. This 
type of theory was used by Thorndike 
(55, 415-21) in his experimental treat- 
ment of CAVD intellect. It has lately 
been criticized as being unable to ac- 
count for all known psychological facts, 
but it still has a champion in Thom- 
son (53). 

Another is Spearman’s two-factor 
theory, which proposed to account for 
any test ability as the additive product 
of a general factor, g, underlying all in- 
tellectual tasks, and one or more specific 
factors. Some consider this theory con- 
clusively established (64, 5-6), while 
others consider it effectively disproved 
(53). Probably neither view is completely 
correct, for such hypotheses as have been 
deduced from its postulates have not yet 
been capable of experimental verifica- 
tion, a fact which weakens the scientific 
value of this theory. 

More recently this theory has been re- 
vised, chiefly by Spearman and Holz- 
inger, to include an intermediate ele- 
ment called a group factor. This theory 
still includes a general factor common 
to all test abilities and many specialized 
factors peculiar to particular perform- 
ances, but introduces a few group factors 
which are assumed to operate in several 
test abilities of a similar type. Some 
psychologists are proposing to include 
among these group factors such matters 
as the interest or zest of the testees in 
taking the test and their perseverance 
toward successful completion.’ This 
theory also remains merely logical or 
mathematical so far without critical ex- 
perimental verification. 

Finally, Kelley and Thurstone have 


* Attributed to W. P. Alexander by Wechsler 
(63, 10). 
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advanced a theory that posits a limited 
number of independent components 
which account for all test performances. 
‘These components or “primary abilities” 
are commonly assumed to have zero cor- 
relation with each other and to combine 
in additive fashion to produce various 
kinds of test performances. This theory 
is currently the only one receiving any 
considerable experimental attention in 
an effort to verify directly or indirectly 
its postulates. 

In view of this preliminary survey of 
current proposals for scientific theories 
which, if verified, would make test re- 
sults a fruitful source of scientific laws 
or generalizations for psychology, it is 
now appropriate to look more carefully 
at mental testing procedures. How likely 
are they to provide the conditions for 
critical verification of any of these the- 
ories? This question may be answered 
by examining a few representative ex- 
amples of various treatments of testing 
procedures and results to see what scien- 
tific conclusions in regard to any of these 
theories are now warranted. 


SCIENTIFIC VALIDITY OF CURRENT 
TESTING PROCEDURES 


1. For first consideration is a method 
of mental testing proposed not so long 
ago by Kelley (22). The aim of this 
proposal is to achieve a means of quanti- 
fying or, at least, identifying differences 
in native intelligence between persons. 
It is conceded, of course, that test in- 
telligence per se is a product of both 
nature and nurture, but if these two 
influences are viewed as parts of an 
atomic system, it becomes conceivable 
that the influence of nurture may be 
eliminated or held constant. The main 
theoretical considerations then involved 
are as follows: 


“As there is no clear consensus of opinion 


| 
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as to the meaning of native intelligence, we 
are not taking any violent liberties with usage 
if we ascribe to it the following properties: 
(a) Except for growth, it does not change in 
the individual as age changes. (b) Natural or 
correct units for its measurement are those 
which reveal this fact. (c) This immediately 
suggests the experimental device of finding 
the natural units of measurement by so de- 
termining test units that the correlation be- 
tween early and late scores is a maximum” (22, 


88). 

The argument stated in this form is 
circular and is quite inadequate as a 
scientific theory. The basic postulate is 
that a _ person’s native intelligence 
changes only in the sense of maturing in 
direct proportion to his age, probably 
ceasing to mature, however, at age 15 
or 16. In operational language, this 
means that a person’s native intelligence, 
during his early maturing years, main- 
tains a constant rank or standing to the 
native intelligences of others in that per- 
son’s age group. One of the hypotheses 
which can be deduced from this postu- 
late is that a true measure or index of 
native intelligence would produce this 
kind of results. According to scientific 
methodology, this hypothesis should be 
submitted to experimental verification 
under conditions where falsification is 
logically possible, and if verification is 
achieved, the validity of the original 
postulate is to that extent established. 
The important element in the verifica- 
tion of this hypothesis is having at hand 
a true measure or index of native in- 
telligence. Such a test is not available, so 
Kelley proposes to use the original postu- 
late, which was to have been submitted 
to indirect verification, as a standard for 
selecting test items which will give 
results in harmony with that postulate. 
Verification is thus foregone, and the 
test results have no more scientific stand- 
ing than the original assumption. 
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The only support that Kelley gives to 
this assumption is by an appeal to the 
sensed-differences in intelligence which 
judges report between school children. 
When a scale based on a standardization 
of the sensed-differences as reported by 
the judges is compared with a scale based 
on test items altered to give maximum 
correlation with the original assumption, 
he finds that the differences on both 
scales in the variations from the mean 
at each level are very similar (22, 104-05). 
Even if this were valid verifying pro- 
cedure for science, it would concern only 
the total effective intelligence exhibited 
by the children and not their native 
intelligence. Notwithstanding the fact 
that Kelley once used reasoning similar 
to this in stating “that 97% of the adult 
difference between arithmetic reasoning 
and spelling is to be attributed to orig- 
inal nature” (21, 18), it is probable that 
most experimenters look upon this as- 
sumption, not as a scientific postulate, 
but as a value standard for practical test 
purposes. For example, one of the stand- 
ards of validity deliberately chosen in 
the latest revision of the Stanford-Binet 
scale is that acceptable test items should 
permit increasing percentages of chil- 
dren to pass any group of items as the 
age levels went up (50, 9). This standard 
is employed presumably to meet custom- 
ary purposes for classifying persons in 
schools and other institutions, and not 
because known’ scientific facts concern- 
ing the nature of intelligence require 
this particular standard. Thus, if there 
should come to prevail a different kind 
of purpose in school practices regarding 
the functional intelligences of children 
in specified situations—calling for, let us 
say, a horizontal classification of intelli- 
gence rather than the present vertical 
classification—a different test standard 
from the present one would undoubtedly 
be employed. 


2. As a second example of mental 
testing procedure designed to make valid 
contributions to a science of psychology, 
the work of Thorndike and his associ- 
ates in constructiong the CAVD Test of 
intellect will be considered. This work 
is monumental, not only because it was 
the first thorough attempt to achieve 
genuine measurement of intelligence but 
also because it makes explicit practically 
all the concepts, assumptions, and _ hy- 
potheses involved. Most other experi- 
menters are more inclined to leave the 
task of unraveling the theoretical kinks 
and of giving precise meaning to con- 
cepts to the reader. The very thorough- 
ness of Thorndike’s description of his 
experimentation makes the few funda- 
mental fallacies in his work more ob- 
vious. 

Thorndike frankly begins his defini- 
tion of intellectual ability from value 
considerations commonly accepted in our 
culture. But the definition he is seeking 
is that of a specific test ability opera- 
tionally expressed. In his words: “An 
ability is defined by making a series of 
tasks such that the score in this total 
series depends on the ability. Thus an 
ability is defined by a total series of 
tasks” (55, 484). Although he gives us 
grounds elsewhere for suspecting that he 
conceives of this ability as only repre- 
sentative of a “deeper” ability behind 
the test performance and organically in 
the person, he is apparently taking the 
more defensible view here that CAVD 
intellectual ability is a critical relation- 
ship between the testee and the test situa- 
tion, expressed in terms of the tasks ac- 
complished. 

The definition of intellectual ability 
he finally reaches is a function of the 
difficulty in overcoming specified kinds 
of intellectual tasks. The concept of dif- 
ficulty used is a group concept based on 
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the distribution of the number of tasks 
accomplished by the members of the 
group. This concept was fully discussed 
in the preceding chapter. Thorndike 
qualifies this general group concept of 
difficulty to mean intellectual difficulty 
by introducing Aristotle as a symbol for 
the highest intellectual power. His group 
concept of intellectual difficulty thus be- 
comes: 

“Enough time being allowed for produc- 
tion so that an increase in time would not 
increase the number producing it, the differ- 
ence for Athenians of 40 is approximately 
greater the smaller the number of them who 
produce it, provided that the ranking of those 
who do produce it differs from the ranking 
of those who do not by greater nearness to 
the Aristotelian end” (55, 26). 

The next step taken is distinguishing 
between several dimensions of intellec- 
tual ability. One dimension is the width 
of intellect, conceived as proportionate 
to the percentage of successes with the 
sample of test items presented at a cer- 
tain level of difficulty. For purposes of 
direct comparison, Thorndike proposes 
that intellect A has a greater width than 
intellect B if A can do correctly all the 
tasks that B can do, and can also do one 
more task at the same difficulty level as 
the others. Another dimension is the 
altitude of intellect, conceived as pro- 
portionate to the degree of difficulty of 
the tasks accomplished. Again for pur- 
poses of direct comparison, intellect A 
has a higher altitude than intellect B if 
A can do correctly all the tasks that B 
can do save one, and in place of that one 
can do one that is more difficult accord- 
ing to the group standard. A third dimen- 
sion is the area of intellect, conceived as 
a function of width and altitude al- 
though not in direct geometrical propor- 
t10n. b 

Here appears the first procedure that 
is scientifically inadmissible: a concept 

| 
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of difficulty defined operationally by the 
total group is applied to the performance 
of individuals. ‘The measurement of how 
well a person can do is stated in terms of 
a value standard arbitrarily defined by 
the group, to a large extent regardless of 
each person’s motivations, particular ex- 
periences, and capacity for improvement. 
Such a standard does not permit a psy- 
chologically discriminating comparison 
between individuals. In the first place, 
attention is given in the scoring to only 
the number of items passed at each level 
of difficulty and not to what items were 
passed. Thus, when two persons are able 
to pass only 50% of the items at some 
level of difficulty (the usual criterion by 
which their altitude of intellect is de- 
termined), it is not only very probable 
that they have failed a number of differ- 
ent items at lower levels of difficulty but 
it is also conceivable that each has passed 
a different 50% of the items at this last 
level of difficulty. The assumption that 
these items are all comparable on the 
basis of sampling is not a demonstrable 
fact and has a more plausible basis social- 
ly than psychologically. Consequently, 
when two persons obtain the same score 
on a mental test as currently constructed, 
we do not know whether the two persons 
are comparable in the organic processes 
involved, or in the sociological influences 
on their development, or in native Ca- 
pacity, or in adaptability to the par- 
ticular test situation; or whether there 
are other possible bases of comparability. 
The prime prerequisite to discriminating 
comparison of individuals in any of these 
respects is the achievement of operational 
definitions of them for the individual 
person. Thus, by the principle of opera- 
tionism, scientific generalizations about 
an individual’s intellect based on Thorn- 
dike’s procedures are ruled out from this 
point on. 
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In addition to his unifactor theory of 
intelligence, described on a_ previous 
page, Thorndike’s test procedures involve 
several other postulates concerning the 
nature of intelligence as a test ability. 
These include, of course, the usual as- 
sumptions that intelligence is qualita- 
tively the same for all testees, and that its 
variations between persons are conse- 
quently only quantitative and additive. 
But especially characteristic of Thorn- 
dike’s theoretical framework are two 
further postulates of fundamental sig- 
nificance: (1) that variations in the 
amount of any one person’s intellectual 


ability over a controlled series of testings . 


are normally distributed according to the 
curve of error; (2) that variations in.the 
amount of intellectual ability possessed 
by an unselected group of persons are 
normally distributed according to the 
probable frequency curve of a large num- 
ber of measured objects of a given kind. 
Not long ago many psychologists were 
prepared to accept these postulates as so 
very probably true that no specific veri- 
fication was needed, but Thorndike at- 
tempts to prove them for the intellectual 
ability he is testing. 

In regard to his proof of the first 
postulate concerning variations in any 
one person’s intellectual ability, it 
should first be remembered that he is 
applying a group concept of intellectual 
difficulty to the individual case, and it 
should be no surprise that he fails to 
achieve verification of this first postulate. 
In brief outline, his method of demon- 
stration runs as follows: Any one per- 
son’s variations in score on a single test 
do not form a normal distribution be- 
cause of the practice effect in repeating 
the test. When any one person is given 
eight different intelligence tests and the 
extent of his deviation from the group 
mean in each case is plotted, many per- 


sons show something approaching a nor- 
mal distribution of deviations around his 
average deviation. In order to do this, 
of course, the person’s raw scores on the 
tests, being obviously incomparable, must 
be translated into a derived unit express- 
ing his deviation from the group mean. 
The units Thorndike uses are tenths of 
sigma units, which are derived from the 
properties of the normal curve, but still 
a close fit to the normal curve has not 
been obtained. However, after each per- 
son’s deviations from the group mean 
are plotted in tenths of sigma units 
around his average deviation from the 
group mean, the data for all persons in 
the group are combined and plotted from 
a common mean which represents any 
person’s average deviation. ‘The resulting 
graph, which represents the “average 
person” in the group, is a reasonably 
close approximation of the normal curve. 

As a method of scientific proof, this 
procedure can be criticized from several 
angles. First, we may well question the 
homogeneity of the property supposedly 
measured by the eight tests. The only 
property we are sure these eight tests 
had in common, outside of such super- 
ficial aspects as similar scoring methods, 
is that each is intended to test the same 
intelligence as the other tests. But each 
has its own kind and proportion of those 
activities generally considered as intel- 
lectual, and what they actually test is 
not scientifically known or operationally 
defined as the same thing. Moreover, 
since no one knows what qualitative as 
well as quantitative effects each of these 
tests had on the individual taking them, 
it is quite possible that these eight tests 
changed each child eight times in sig- 
nificant but unknown ways. There is no 
assurance that the eighth test was not 
testing a very different, “‘test-wise”’ ability 
from that of the first test. And if all the 
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children in the control group did not 
take the eight tests in the same order 
(this was not reported), the status of the 
results for scientific treatment is even 
more chaotic. Use of normal curve units 
cannot help this situation, for although 
each child’s average may be in statisti- 
cally equal units, there is every likelihood 
that the existential content of the raw 
scores from each test are not only un- 
equal but qualitatively different. 

Second, the shift from raw scores of 
intellectual tasks accomplished to sigma 
scores of relative standing in the group 
was necessary to make the results of the 
eight tests comparable, but this changes 
the question from the variations in the 
number of intellectual tasks he can do 
to the variations in his relative stand- 
ing in a group. The discussion in the 
preceding chapter demonstrated that 
sigma units along the normal curve base 
line had no existential reference unless 
the original scores were equal units of 
some continuous property. The original 
raw scores made on the eight tests clearly 
do not represent equal units nor even 
approximate amounts of an operation- 
ally defined continuous property. If we 
are not sure of the equality of our units, 
or of what they stand for, our ignorance 
or uncertainty is not corrected by merely 
submitting the data to statistical treat- 
ment. Hence, the attempt to establish 
the normal distribution of several expres- 
sions of a person’s test ability by this 
procedure falls far short of scientific 
validity. 

Further criticisms of Thorndike’s 
method on this point are scarcely neces- 
sary and serve chiefly to illustrate the 
methodological difficulties he encounters 
as a consequence of this false start. The 
final step in the attempt at verification 
merely establishes the normal distribu- 
tion of deviations from the mean for the 


| 
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“average person” in the group, a fact 
which might have scientific meaning for 
the group as a whole but not for any 
particular individual in the group. The 
normal distribution which Thorndike 
succeeds in approximating with his data 
is not the normal curve of error usually 
resulting from repeated measurements 
of the same thing, as appears to be his 
assumption, but is very probably the nor- 
mal probability curve which is fulfilled 
when the frequency of occurrence of a 
series of similar events (e.g., passing test 
items) is due to a large number of inde- 
pendent factors equally free to operate. 
Some of these factors might well be such 
things as individual motives, differences 
in background among testees, variations 
in the meaning of the test instructions, 
the large number of testees included, the 
method of scoring the test, the number 
of tests taken, and the conversion of raw 
scores into sigma scores before the nor- 
mality of the distribution was established. 
However this may be, the main point is 
that Thorndike’s first postulate has not 
been proved, and consequently there is 
no scientific basis for its wide use as a 
corrective device in his later procedures. 

His proof of the second postulate, 
that the quantitative differences in the 
amount of intellectdal ability possessed 
by the members of an unselected group 
also follow closely the normal curve, in- 
volves the same fallacies as the above 
case. No proof at all is offered that the 
ability is qualitatively the same in all 
the testees, though varying quantitative- 
ly. This is assumed, and all attention is 
given to finding the “true” nature of 
the distribution of this assumed prop- 
erty. He notes that on any one test the 
scores of the testees tend to follow vari- 
ous rough approximations of the normal 
curve. However, the tests he uses have 
not been scaled for equal difficulty or 
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equal increments of difficulty in this 
ability, and their common property is, 
as before, each author’s idea of what 
constitutes a reasonable range of intel- 
lectually difficult tasks. To correct for 
the admitted inequalities in the units of 
any one test, Thorndike proposes to 
eliminate chance inequalities by combin- 
ing the performances of the group on 
eleven separate intelligence tests. 

Again he is faced with the problem of 
making the scores comparable, a mani- 
fest impossibility directly. Since the dis- 
tributions he has obtained by unequal 
units of “intellect” on each test by itself 
seem to suggest the normal curve more 
than any other, he apparently assumes 
they probably should be normal and 
converts the scores of each distribution 
into one-tenth standard deviations from 
the mean. Since the normal curve is not 
based in this case on an operationally 
defined continuum of intellect or of its 
units of measurement, the mathematical 
unit of one-tenth sigma of course has 
no existential content or reference. With 
these units of “pure form’ he combines 
the eleven distributions and finds to no- 
body’s surprise that the composite dis- 
tribution looks more like a normal curve 
than any of the original distributions. 

Neither of the two fundamental postu- 
lates in Thorndike’s theory has been 
verified in any scientific sense. The proof 
consisted essentially of taking out of the 
procedures what was originally read into 
them. The normal distribution of the 
intelligence of an unselected population 
remains merely an assumption. As an- 
other leader in intelligence testing ob- 
serves, “This may or may not be true. 
There are biological characters for which 
it is not true, and intelligence may con- 
ceivably by one of them” (50, 24-25). 

From this point on, however, Thorn- 
dike uses these two postulates as verified 


laws by which the results of his CAVD 
test may be legitimately altered and 
“corrected.” As one illustration, when 
Thorndike finds that his test data fit 
the normal curve just fairly well, he cor- 
rects the original test “units” so that the 
results will fit the curve more perfectly 
(55, 226). This procedure is intended to 
obtain a “truer” set of units, but from 
the standpoint of science it is an attempt 
to conjure knowledge out of ignorance. 
As another illustration, Thorndike ap- 
pears to assume that the achievement of 
a single normal curve proves the homo- 
geneity of the ability, while the discovery 
of two or more curves proves that the 
results are non-homogeneous (55, 422- 
25, 371, Ch. VI Passim). This assumption 
was specifically used in “proving” the 
homogeneity in the range of intellect 
CAVD between progressive age groups. 
We should hardly need to repeat that 
hypothetical deductions of this sort pre- 
sent no scientific proofs but merely hy- 
potheses to be experimentally verified. 
Homogeneity can be assured only by 
establishing beforehand by operational 
definition that the characteristic in ques- 
tion is common to all the members. 
Thorndike’s methods have been ex- 
amined to discover whether they repre- 
sent a genuinely scientific contribution 
to the theory and knowledge of human 
abilities. We have found that they failed 
on two scores: (1) He was obliged to 
assume the very bases which needed to 
be verified in order to make his theory 
scientifically valid. (2) By relying on the 
“fact” of the normal distribution of in- 
tellectual ability after a totally inade- 
quate proof, he corrects his experimental 
data to fit the curve, and thus removes 
all possibility of ever verifying directly 
or indirectly his original assumptions. 
His approach, consequently, offers little 
of significance to a science of psychology, 
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and the conclusions he draws from this 
approach should not be interpreted as 
scientific generalizations. However, this 
appraisal should not be construed as a 
denial of much practical usefulness in 
his work for many specific situations 
which call for evaluational comparisons 
between persons in terms of a group 
standard of intellectual behavior. 

3. As a third and last. example of 
mental testing procedures which are de- 
signed to make valid contributions to a 
science of psychology, a recent move- 
ment called factor analysis will be con- 
sidered. Earlier in this chapter a func- 
tional approach and an analytical ap- 
proach to test abilities were described 
and distinguished. The analytical ap- 
proach, it will be recalled, seeks certain 
elementary components which may be 
said to constitute, describe, or “account 
for” the functional test ability. The 
search for these components is commonly 
termed factor analysis. The chief prog- 
ress made by this movement in recent 
years has been in the development of 
more refined and more elaborate statisti- 
cal techniques, but this discussion will 
focus on the scientific validity of its 
methodology rather than on its mathe- 
matics. 

Factor analysis starts with the inter- 
correlations of a group’s performances 
on different tests. Instead of building 
one test or a few tests, as Thorndike 
did, experimenters in factor analysis em- 
ploy a great many tests, sometimes as 
large a number as sixty. The chief pur- 
pose of the analysis is to find the small- 
est number of statistical components or 
factors to account for the intercorrela- 
tions between these many tests. In prac- 
tically every case factors are sought which 
are independent (i.e., statistically un- 
correlated), although this requirement is 
admitted as a convenience rather, than 


as a Statistical necessity (56, 222-28). 

Of considerable significance is the fact 
that several experts in this field disclaim 
any existential status for factors in a 
science of psychology. According to 
Thomson: “Factors are statistical coef- 
ficients, changing with the sample and 
the conditions and dependent upon 
stated assumptions: but with defined con- 
ditions and assumptions they are most 
useful as descriptive terms” (52, 77). And 
Burt is even more emphatic in his con- 
viction that “a factor is primarily a prin- 
ciple of classification and nothing more. 
. . . The temptation to identify an ab- 
stract and hypothetical factor, reached 
by mere mathematical analysis, with a 
real and concrete ‘factor in the mind’ 
(e.g., with an ‘ability’) is simply a new 
instance of the old problem that has 
divided nominalists, realists, and con- 
ceptualists for centuries” (6, 84-85). The 
soundness of these views is given strong 
support by the fact that there is no com- 
monly accepted, unique method of fac- 
torizing a set of intercorrelations. Many 
methods are possible, each based on dif- 
ferent kinds of assumptions, and no 
method has been conclusively proved su- 
perior to another. However, this need 
not imply that factor analysis is fruitless. 
On the contrary, if a few factors, how- 
ever they be reached, can give as much 
useful information as a large number of 
tests, there is an obvious economy in their 
use. 

But other experts in this field are seri- 
ously trying to find existential, scientific 
referents for their factors. Stephenson 
(47, 94-104), working in the Spearman 
tradition, is impatient with the view 
that factors are merely mathematical 
methods of classification. He is seeking 
factors which he has already tentatively 
named and proposes building new and 
broader tests to verify his hypotheses. 
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Thurstone, the outstanding American in- 
vestigator in factor analysis, foresees a 
similar scientific significance for these fac- 
tors, although he has named his factors 
differently and uses a different method of 
factorizing. A few years ago Thurstone 
expressed confidence that his factors not 
only would replace such indices of gen- 
eral intelligence as the IQ and mental 
age (58, 133) but would also make pos- 
sible detailed studies of inheritance (60, 
52). More recently he has indicated that 
factor analysis is primarily on the border- 
line of science, suggesting hypotheses 
about mental abilities which could be 
rationalized and submitted later to veri- 
fication (56, 189). Since Thurstone is the 
only investigator who has taken concrete 
steps toward verification by his hypoth- 
eses, his work can be considered here 
as representative of what may be ex- 
pected of scientific significance from 
factor analysis. 

The bases of Thurstone’s attempts to 
conduct scientific experimentation with 
factor analysis are a mixture of shrewd 
empirical guesses and eclectic statistical 
methods. As the most economical start- 
ing point, he tentatively assumes that the 
“real” factors may be independent, sta- 
tistically uncorrelated.»He holds that in 
factor analysis it is unnecessary to make 
any assumptions about the nature of the 
factors—consequently, they may be physi- 
cal, psychical, chemical, social, native, or 
acquired. Their particular nature will 
have to be determined later by some 
experimental test. Regardless of their 
particular nature, he believes it prob- 
able that there are a great multitude of 
determiners and that factors are func- 
tional groupings of these determiners. 
However, he is rather convinced that the 
true mental factors, when found, will op- 
erate in arithmetical proportion in pro- 
ducing various performances—i.e., that 


they are members of an atomic system 
rather than an organic system. In consist- 
ency with the logic of these assumptions, 
he takes as a basic principle that the 
true factorial composition of a test will 
remain invariant when the test is trans- 
ferred between batteries containing the 
same common factors and also when it 
is given in a battery to different but 
comparable populations. This principle 
goes much further than the significance 
which Thomson, as quoted on a pre- 
ceding page, attached to factors. 

The particular statistical method of 
factorizing used is chosen for its promise 
in producing something that can be veri- 
fied as having existential meaning. Form- 
erly Thurstone used a method which 
continued to evolve factors until all the 
correlation residuals were within the 
errors of sampling, but this seemed too 
arbitrary to be successfully rationalized. 
Recently he has employed the principle 
of “simple structure’—a method which 
maximizes the number of zero correla- 
tions and the size of the few significant 
correlations so that certain test perform- 
ances can be very largely described by 
perhaps only one or two factors. When 
he finds certain groups of tests heavily 
saturated with chiefly one factor, he is 
in a position to make shrewd guesses 
as to the existential nature of that factor. 
For example, high correlations between 
certain arithmetic tests led him to hy- 
pothecate a number factor. So far Thur- 
stone’s statistical methods have produced 
twelve factors, all about equally large, 
but he has been able to suggest names 
or existential meaning for only seven of 
them. The remaining five have not been 
sufficiently isolated in his present tests 
to make plausible hypotheses about their 
nature possible. 

The scientific value of this type of 
factor analysis awaits the verification 
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of the hypotheses involved. Factor analy- 
sis is not yet equipped with a comprehen- 
sive scientific theory from which these 
hypotheses are logically deduced. As was 
noted above, Thurstone grants that 
factor analysis so far is merely a fruitful 
source of surmises, some of which may 
be verified and then rationally formu- 
lated. Consequently, present attempts at 
scientific verification of the nature of 
these factors follow a cut-and-fit method. 
When certain factors found in one test 
battery appear to represent clearly dis- 
tinguishable psychological processes, a 
new battery is constructed including 
tests which are designed to represent the 
assumed nature of these factors as strong- 
ly as possible. Then the results of, the 
new battery are factorially analyzed to 
see whether certain tests especially con- 
structed for one factor actually show 
very high intercorrelations and very low 
correlations with other groups of tests. 
The results of recent attempts to verify 
Thurstone’s current guesses about the 
nature of seven of his factors have not 
been particularly encouraging. The 
American Council on Education has pub- 
lished an experimental battery of sixteen 
tests designed to reveal these seven inde- 
pendent factors or “primary mental 
abilities.” The few experimental uses of 
the battery that have been reported 
(10; 46) show rather similar results—the 
factor scores which should be independ- 
ent of each other correlate too high (in 
the Crawford report almost half had an 
intercorrelation of over .40), and the 
tests which should be highly saturated 
with the same factor correlate too low 
with each other (Stalnaker reports a 
range from .20 to .7g9 with a central point 
of .49). Thurstone’s indirect answer to 
these implied criticisms has been that 
the simplification required for a battery 
of only sixteen relatively brief tests has 


introduced many errors leading to inter- 
correlations between separate primary 
abilities, and that the low correlation be- 
tween tests of the same factor merely 
indicate considerable differences in the 
factor saturation of each test (56, 232). 

Although the above discussion has sug- 
gested that factor analysis of test per- 
formances has not proceeded to the point 
where its significance for a science of 
psychology can be decisively determined, 
there are considerable grounds for doubt 
that a scientific identification and de- 
scription of primary mental abilities will 
be achieved by this treatment of test re- 
sults even in the long run. The first and 
least consequential ground for doubt 
is the current incompleteness in certain 
of the experimental controls. In particu- 
lar, there is so little control which can 
now be exercised over the subjective mat- 
ter of interpreting the existential mean- 
ing of factors and selecting the most 
representative tests* that the trial-and- 
error checking of every possible inter- 
pretation looks like an endless job. Even 
if only psychological interpretations of 
these factors were possible, this point 
would still carry much weight. But, as 
all factor analysts admit, these factors 
could have sociological, environmental, 
or purely local or “accidental” meaning 
as easily as psychological meaning. 

A second ground for doubt is that the 
postulates and assumptions commonly 
used in factor analysis depend upon the 
test results’ providing more and better 
information than they actually do. For 
example, one assumption is that each 
factor bears a linear relation to the test 
scores. This demands that a score on a 
test must correspond to a person’s ability 
to do that kind of task. But in the pre- 
ceding chapter we found no scientific 


®See Quinn McNemar’s review of Thurstone’s 
Primary Mental Abilities (29). 
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certainty that raw test scores represent 
any particular amount of difficulty over- 
come (i.e., ability with this type of diffi- 
culty), and no verification that difficulty 
(or ability to overcome it) is a linear 
continuum varying as the test scores. An- 
other assumption is either that each per- 
son possesses an ability-component on 
an all-or-none basis or that every normal 
person possesses some amount of each 
primary ability. If the former, it means 
that any factor presumed to be psycho- 
logical in nature, if not possessed by each 
person in the group, is at least homo- 
geneous among those persons whose 
test scores indicate they possess it. If the 
latter, it means that, while the factors 
(or primary abilities) may enter into the 
accomplishment of different tasks in dif- 
ferent degrees, the weighting or loading 
of the factors required for achievement 
in any group of similar tasks is qualita- 
tively and proportionately the same for 
all testees, who differ only in their rela- 
tive ranks in this proportion. In either 
case, the assumptions require tests which 
present the same kind of difficulty to all 
testees or, in more familiar terms, de- 
mand the same kind (and only that 
kind) of ability from each testee. But 
again this condition was found in the 
preceding chapter to be one of the major 
uncertainties, for scientific purposes, of 
current mental tests. 

A third ground for doubt concerning 
the eventual scientific significance of 
factor analysis of test results is also a 
consequence of the insufficient scientific 
information provided by the tests. In this 
instance, the statistical techniques of 
factor analysis impose logical conditions 
which are inadequately fulfilled by test 
scores. For example, the normal distribu- 
tion of test scores commonly sought at 
present is required to be a normal dis- 
tribution of amounts of ability. Further- 


more, it is not enough that persons be 
subject only to ranking in regard to the 
amount of ability they possess. Their 
standings must be capable of differentia- 
tion by equal units, expressed in “z” 
scores or standard scores, so that the 
distribution of standings on several tests 
may be combined or meaningfully cor- 
related. Subsequent procedures of re- 
volving the axes and factoring out the 
components depend upon fulfilling the 
conditions of these first basic steps. How- 
ever, a considerable section of Chapter 
IV was devoted to the make-believe 
quality of such interpretations of a nor- 
mal distribution of test scores. Since the 
base line of the normal curve in this 
case is not known to be either a con- 
tinuum of ability or a continuum of 
equal units, the logical equality of stand- 
ard scores derived from sigma has no 
reference to existential amounts. ‘Thus, 
the logic of inquiry pursued by factor 
analysis is broken at the roots. 

The nature of these obstacles to the 
scientific treatment of test results by 
factor analysis gives further support to 
the conclusion, reached twice before in 
this chapter, that mental tests have not 
served, and are not very likely to serve, 
as means for verifying any of the cur- 
rently proposed scientific theories on the 
nature of human abilities. This, of 
course, does not reflect upon factor an- 
alysis as a technique of investigation. 
Applied to other psychological data than 
that obtained from mental testing, it may 


ultimately make a far greater contribu- 


tion to the science of psychology than 
appears possible from its present applica- 
tion. But by the criteria of science, the 
methods and results of mental testing 
appear incapable of making a contribu- 
tion to the scientific problems of psy- 
chology. 
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CHAPTER VI 


MENTAL TESTS AND PSYCHOLOGICAL ISSUES 


Pp TO this point the significance of 

mental testing has been qualified 
as follows: (1) As scientific instruments 
for quantifying amounts of difficulty 
(and, hence, amounts of human abilities), 
mental tests provided results susceptible 
only to very rough ranking techniques, 
and these results were too ambiguous for 
rigorous logical treatment. (2) None of 
the theories on the nature of psychologi- 
cal abilities has been decisively verified, 
even partially, through mental testing 
procedures, and there do not appear to 
be grounds for much hope in this di- 
rection. As a matter of fact, current 
theories of intelligence and of other hu- 
man abilities are not productive of hy- 
potheses which appear capable of verifi- 
cation through mental testing proced- 
ures, 

These limitations of the significance 
of mental testing should not obscure the 
fact that the movement has made at 
least one important contribution to a 
genuine science of psychology. It arose 
at a time when the standard approach 
to psychological phenomena was intro- 
spection, and its immediate success in 
practical matters of classifying school 
children stimulated a widespread adop- 
tion of the behavioristic approach to 
psychological problems. This latter ap- 
proach now characterizes the point of 
view of most significant types of psycho- 
logical experimentation. Although it now 
appears quite possible that mental test- 
ing will be displaced in large part by 
better techniques rather than itself un- 
dergo refinement for basic scientific pur- 
poses, mental testing has served an; im- 


' 


60 


portant transitional function in_ the 
search for methods which will establish 
a fundamental science of psychology. 

The above limitations should also not 
obscure the fact that the great services of 
mental testing undoubtedly lie largely 
in the field of pragmatic evaluation of 
human behavior for a vast range of pur- 
poses. Its many uses in guidance, in pre- 
dicting success in school, in identifying 
mental defectives, in classifying persons 
for various jobs, and the like, are widely 
recognized and are usually sound. The 
effectiveness of tests for these purposes 
will: be enhanced if the ambiguous tend- 
ency to attach pseudo-scientific signifi- 
cance to certain test results is decisively 
resisted. For this tendency the test build- 
ers themselves are partly but not wholly 
to blame. Accordingly, the work of this 
chapter will be to examine the logic of 
some of the supposedly scientific generali- 
zations currently made on the basis of 
test results. 


THE NATURE-NURTURE ISSUE 


One of the oldest and most important 
questions which has concerned psychol- 
ogy is the relative kinds and amounts of 
influence exerted by heredity and en- 
vironment upon various types of human 
behavior. This question has not yet been 
met by a systematic scientific theory of 
interlocking postulates and logically de- 
ducible hypotheses which, if verified, 
would give predictive control over the 
individual case. Current speculations on 
the topic differ even as to whether an 
atomic or an organic system is the more 
appropriate kind of theory. Support for 
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the present conflicting views rests largely 

upon intimations and interpretations. 
Although the question still awaits 
scientific attack and solution, consider- 
able recourse has been made to mental 
testing results in the past few decades 
for support to the views that either 
heredity or environment. has prepotent 
influence in determining one’s intelli- 
gence. Without conclusive facts being 
available on the question, it has been 
possible for reputable psychologists and 
educators to dispute, occasionally with 
some heat, the relative importance of 
heredity and environment. The motiva- 
tion to the dispute is understandable be- 
cause of the differing educational conse- 
quences following from either view, but 
of special interest is the very widely held 
assumption that mental test results hold 
the key to which side is right. In fact, 
two yearbooks (34, 35) of the National 
Society for the Study of Education are 
landmarks in the elaborate application 
of intelligence tests to the problem of 
nature and nurture. 

On what grounds can the results of 
intelligence testing be considered perti- 
nent to this scientific problem? Admitted- 
ly we know nothing about the genetic 
composition of so-called human abilities. 
Consequently, we know nothing about 
how the environment and these genetic 
bases of ability interact. For lack of any 
scientific knowledge, the common sense 
view of the matter is to assume that the 
effects of heredity are whatever is left 
over after the effects of environment 
have been taken out or equalized. Thus, 
if two children of the same age reared 
in similar families with equal care and 
with approximately equal opportunities 
are given an arithmetic test or a spelling 
test, the difference in their performances 
is to be attributed largely to hereditary 
influences. Then if the weaker child is 


given special training in arithmetic or 
spelling so that on a later test he equals 
or exceeds what the other can then do, 
that improvement is to be attributed to 


environmental influences. If then the_ 


originally stronger child is given the 
same amount (common sense seldom 
bothers to define this too well) of special 
training, the difference between their 
performances on the third application of 
the test is again attributable to heredi- 
tary influences. 

Of course, there are several objections 
which psychologists raise to this common 
sense view. For one, the observed differ- 
ence between their first performances 
may be due, not to innate ability, but to 
different rates of mental development in 
these skills. The changed relationship of 
their second performances might there- 
fore be the result of a temporary or pos- 
sibly permanent stimulation of the first 
child’s rate of development in this field. 
Without further knowledge on this point, 
the results of the third test would be 
open to any one of several conflicting 
interpretations. 

A second objection to this use of arith- 
metic or spelling tests follows from the 
fact that hereditary influences are here 
viewed as purely a function of the diff- 
culty of learning, becoming negligible 
when learning is easy and overwhelming 
when learning is very hard. In this case 
arithmetic and spelling tests involve ma- 
terial too easily subject to slight differ- 
ences in learning opportunities. On the 
other hand, a very hard test, such as 
comparing the metaphysical assump- 
tions of Plato and Kant, would probably 
not be greatly influenced by slight differ- 
ences in learning opportunities but 
would be greatly dependent on the very 
unlikely fact that everyone in the group 
had read these two philosophers. In 
short, the basis of this typical objection 
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is that innate mental ability would be 
best expressed in a test consisting of 
items which no one could answer better 
by living in a different home, or by 
working at a different job, or by going 
to a different school, and yet with which 
everyone in this culture had approxi- 
mately equal opportunity to become fa- 
miliar. This requirement immediately 
rules out tests like the usual ones in 
arithmetic, spelling, and philosophy. 
Now what is different about an intelli- 
gence test which supposedly makes it 
applicable to the nature-nurture problem 
while ordinary arithmetic and spelling 
tests are not? As near as one can gather, 
the line of argument seems to run as 
follows. ‘There appears to be a wide- 
spread conviction that heredity is the 
exclusive determiner of something called 
innate capacity. This innate capacity is 
fixed by the genes. It cannot be ex- 
panded or contracted. Given a reason- 
able chance, it can only grow and mature. 
In the growth of the child, innate ca- 
pacity for intelligence and the various 
environmental stimulants to intelligence 
of course interact, producing that func- 
tional aspect of behavior which we call 
intelligence. It is further assumed that 
the interactions of nature and nurture 
take place, not in an organic system, but 
in an atomic system. A person’s effective 
intelligence at any given time is a whole, 
but it is some sort of additive whole. 
Some part of the total intelligence at a 
given age is preponderantly attributable 
to the normal growth of native capacity. 
The remaining part is preponderantly 
attributable to the special effects of pe- 
culiar environmental influences. In other 
words, that part of intelligence which is 
developed by the normal group experi- 
ences of growing up is taken to be the 
revelation of that person’s inherent, 
fixed capacity for development in this 
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culture. That part of intelligence which 
is developed by unusual or atypical ex- 
periences or which is rather easily sub- 
ject to special training is taken to be the 
“extras” which everybody acquires in 
various amounts according to the acci- 
dent of circumstances. 

Certain special characteristics are 
claimed for intelligence tests which 
would seem to commend them particu- 
larly to this view. In the first place, the 
special kinds of mental activities which 
are placed in current intelligence tests 
are considered by many experts to be 
extremely likely though indirect indices 
of innate mental ability. The tests may 
not be measuring devices in the strictly 
scientific sense, but at least they can be 
considered sampling devices, the scores 
on which indicate how many intellectual 
tasks (out of a representative sample of 
all types of intellectual tasks) each per- 
son can accomplish in comparison with 
other persons in his group. Much of the 
confidence in tests of this nature seems 
to be based on the fact that the tasks in 
the test are so carefully chosen that, out- 
side of direct practice on them or their 
equivalent, they are not made easier to 
accomplish by any of the normal experi- 
ences of living except that of growing up 
or maturing. Since this is not so true of 
tests like those in arithmetic and spell- 
ing, they are not considered to be such 
valid indices of innate mental ability. In 
the second place, when a person’s rank, 
as expressed by the age group which. on 
the average makes his score, is divided 
by the intellectual rank expressed by his 
actual age (and multiplied by 100), the 
resulting IQ rank for that person within 
his group has been found to remain rea- 
sonably constant from childhood to ma- 
turity, thus suggesting a direct correla- 
tion with the presumed nature of the 
growth of innate capacity. 
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If these bases seem inadequate to the 
reader for scientific studies of the influ- 
ences of nature and nurture, let us look 
at some of the claims being made from 
the results of experiments using intelli- 
gence tests. On the basis of test results 
up to 1928, expert opinion tended to 
agree with the conclusions of investigat- 
ors like Terman and Burks, who felt able 
to assign as much as 75, to 80 percent of 
the influence on a person’s intellectual 
status to heredity (65, 505). This conclu- 
sion is still fairly common. Leahy (24, 235- 


303), in an unusually elaborate study re- 


ported in 1935, felt justified in assigning 
to environment only 4 percent of the 
credit for differences in the IQ’s of her 
subjects, and Thorndike (54, 320) in 1940 
concluded that the factors producing 
differences in intelligence between per- 
sons should be allotted as follows: to the 
genes, 80 percent; to training, 17 per- 
cent; to accidental factors like measure- 
ment errors, 3 percent. However, there 
is a growing group who have come to 
different conclusions. In the light of 
fairly recent test results, many reputable 
investigators, notably those from Iowa 
State University, are giving a much 
greater influence to environment. In 
view of the conflicting contentions, the 
safe position today seems to be that the 
facts are inconclusive and can be inter- 
preted to support a nature bias or a nur- 
ture bias about equally well. But since, 
in the light of the preceding chapters of 
this study, this scientific question would 
not even be capable of solution by pres- 
ent mental testing procedures, the bases 
of the current argument deserve some 
examination. The bases may be classified 
under three heads: (1) the likelihood 
that present intelligence tests are indirect 
indices of innate ability; (2) the ade- 
quacy of the controls over nature and 
nurture; (3) the constancy of the IQ. 


INTELLIGENCE TESTS AND INNATE 
ABILITY 

No one claims that an intelligence test 
reveals inherited intellectual capacity ex- 
clusively. The performance on the test is 
obviously a product of hereditary and 
environmental influences, however they 
may operate. Yet the argument of many 
investigators is that the effects of nature 
and nurture are potentially distinguish- 
able in the performance of certain test 
activities under controlled conditions. 
The selection, arrangement, and stand- 
ardization of these certain activities are 
designed to provide generous opportuni- 
ties for innate mental ability to be re- 
sponsible for a significant share of the 
performance elicited. This selection, ar- 
rangement, and standardization of test 
items, however, also involves unescap- 
ably certain assumptions concerning 
what the test builders consider to be the 
nature of innate ability. The extent to 
which present intelligence tests can be 
considered indices of innate mental abil- 
ity is proportionate to the extent to which 
these assumptions about innate ability 
are capable of the direct or indirect veri- 
fication required by science. Conse- 
quently, the most important of these as- 
sumptions will be examined in this light. 

The first assumption is that the 
amount of adventitious information and 
skill with abstractions which a person 
acquires in normal living is due in sig- 
nificant part to the amount of innate 
capacity for intelligence which he pos- 
sesses. This common sense point of view 
has wide currency in our culture, but 
since we have no direct or independent 
knowledge of innate ability the relation- 
ship implied in this assumption is not 
yet capable of any discriminating veri- 
fication. 

A second assumption, concerning all 
tests which provide an MA or an 
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IQ, is that the intelligence produced by 
native capacity and learning is a gener- 
alized factor, participating in part and 
without qualitative change in such di- 
verse activities as analogies, vocabulary 
matching, similarities and differences, 
verbal completion, and memory for 
digits. This also is neither verified nor 
verifiable at present. In fact, Thurstone’s 
work in factor analysis is proceeding on 
the quite different but also unverified 
assumption that the mind is composed 
of a number of primary, uncorrelated 
abilities. If Thurstone should happen to 
be right, existing IQ computations might 
continue their standing as a generalized 
social standard of desired intellectual 
behavior, but scientifically they would 
be meaningless “averages” of unknown 
amounts of discrete mental abilities. 

The third assumption commonly made 
by test builders concerning the nature of 
innate ability is that it grows and de- 
velops without qualitative change in all 
persons from year to year, varying only 
in quantity or amount between persons 
and between ages until mental adult- 
hood is reached. On the basis of this as- 
sumption, considerable effort is made to 
select test items which appear to require 
the same kind of mental exercise on dif- 
ferent age levels. In actual practice this 
is an unrealized ideal, for it is obvious 
to casual inspection that some of the 
mental functions called for between 
widely separated age levels are very prob- 
ably different to a considerable degree. 
Thus, this assumption is not only un- 
attainable in practice to a significant ex- 
tent, but is also unverifiable in current 
testing procedures. 

A fourth assumption is that innate in- 
telligence matures regularly and smooth- 
ly for most persons up to a certain point. 
[f pupils should show on a test an ir- 
regular increase in intelligence with jin- 


| 


creasing age, that would be taken as evi- 
dence that the performance being tested 
was the result of different opportunities 
for learning rather than of innate intel- 
ligence (8, 226). Consequently, part of 
the process of standardizing the test for’ 
validity is to obtain items which will give 
this regular increase with age (50, 38-41). 
Since the assumption is employed to 
make the test fit the assumption, the im- 
possibility of independently verifying the 
assumption is apparently recognized and 
conceded. 

The foregoing assumptions all reflect 
a common sense but scientifically arbi- 
trary conception of the characteristics of 
innate ability. This is the kind of intel- 
lectual ability which is sampled by pres- 
ent intelligence tests. But whether this 
conception is something more fundamen- 
tal than a currently practical, common 
sense view is not known nor so far verifi- 
able. These grounds are therefore hardly 
adequate for a purportedly scientific 
study of the influence of heredity on the 
intelligence of persons. The great desid- 
eratum at this point is verifiable knowl- 
edge of the dimensions of innate capacity 
and how it develops to maturity. Perhaps 
Courtis was nearer the truth when he 
said: 

“It is a fact that for six children out of ten 
the IQ can be determined from the number 
of teeth cut as well as from a Binet test. In- 
deed, it is my belief that if we only kept in- 
dividual records of the number of teeth cut 
each year, we should be able to determine 
IQ’s better from such records than from Binet 
tests. We have quite erroneously, I believe, 
interpreted the IQ as a measure of capacity 


when it is really nothing more than a meas- 
ure of relative rate of development” (9, 409). 


THE ADEQUACY OF EXPERIMENTAL 
CONTROLS OVER NATURE AND 
NURTURE 

In view of the conclusions of the fore- 
going section, we cannot expect current 
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intelligence tests to provide an experi- 
mental control over the innate aspects of 
the intelligence expressed in the test. 
Undoubtedly innate capacity contrib- 
uted in some fashion to the perform- 
ance, but we have no verifiable grounds 
for using the assumptions of intelligence 
test construction in defining the nature 
and dimensions of native ability. Hence, 
we must look elsewhere for controls over 
the effects of nature. 

The effects of nature may be permitted 
to vary, it is assumed, if the effects of 
nurture can be controlled in the con- 
struction and administration of the in- 
telligence test. The specific problem is 
to exclude all environmental influences 
not equally shared by the members of 
the group. One type of control is sought 
through the standardization of the test 
norms, while another type is sought in 
connection with the administration of 
the test. A consideration of both types is 
pertinent to the problem at hand. 

An important characteristic of all 
types of control exercised over environ- 
mental influences is that environment is 
interpreted almost entirely in the socio- 
logical sense and very little in the psy- 
chological sense. For example, if two 
testees are from the same grade in school, 
come from Anglo-Saxon stock, and have 
grown up on farms next door to each 
other, they have presumably been subject 
to very similar environmental influences. 
This view does not take account, how- 
ever, of such probable facts as (1) psy- 
chologically different reactions of indi- 
viduals to a sociologically similar back- 
ground, and (2) psychologically different 
reactions of individuals to the challenge 
or appeal of the artificial test situation. 
Considerable effort is usually made to 
take care of this second fact by securing 
rapport before the test begins and 
rigidly standardizing the test administra- 


tion, but the problem is acknowledged 
by test experts as still being unsatisfac- 
torily solved (56, 233). The only consid- 
eration usually given to the first fact 
is either to ignore it or to assume that 
any psychologically different reactions 
which occur are functions of innate 
ability. 

Some test builders frankly recognize 
these difficulties but attempt to circum- 
vent them on practical grounds in the 
construction of test norms. For dealing 
with individual persons, these difficulties 
can be acknowledged as part of the ex- 
perimental error in intelligence testing, 
but for a large, standardized group of 
persons, these sources of error may be 
expected to cancel out by chance, at 
least to a large extent: Attention can 
then be given to the selection of a group 
of persons whose social backgrounds will 
be representative of the various social 
backgrounds found in this country. The 
fulfillment of this condition, it is argued, 
makes the test applicable to the largest 
number of people. One of the most diffi- 
cult jobs of test standardization is to find 
groups of persons of the same range of 
social background on each age level. 

The first criticism to be made of this 
procedure is the inadequacy of the kinds 
of social environment selected to be rep- 
resented in the standardizing group. 
Test builders have been aware of this 
deficiency, and later revisions of intelli- 
gence tests have been more broadly rep- 
resentative of various geographical 
areas in the country and of the popula- 
tion distribution between urban and 
rural areas. But in almost no case has 
systematic account been taken of socio- 
economic status or occupational rank as 
an environmental influence on the devel- 
opment of test intelligence (36, 752-57). 
Since socio-economic status undoubtedly 
has considerable to do with opportunity 
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for education, cultural stimulation, and 
intellectual attitude, the well known cor- 
relation between social status and intel- 
ligence test scores is probably to a large 
extent a correlation of “A” with “AB.” 
Moreover, when socio-economic status is 
uncontrolled in the standardizing group, 
as has been the practice, that group it- 
self will be differentiated to some un- 
known extent according to social status 
in the proficiency the subjects show at 
many tasks (vocabulary, information, 
knowledge of relationships, etc.), and 
this distorting element will be perpet- 
uated in all subsequent applications of 
the test. Only the last revision of Ter- 
man’s Group Test has used a fairly repre- 
sentative sample of the social levels in 
our society. 

The second criticism to be directed at 
this method of standardizing environ- 
mental influences is more fundamental. 
It is an accepted principle of test con- 
struction that the scoring norms are valid 
only for persons or groups subject to 
similar environmental influences. We 
have noted that the test which has been 
standardized on the most widely repre- 
sentative social backgrounds, the latest 
revision of the Terman Group Test, has 
included groups from West, East, North, 
and South, from urban and rural areas, 
and from various socio-economic levels. 
Now presumably each group from a sig- 
nificantly different social background 
would have a unique norm, characteris- 
tic of the unique set of environmental in- 
fluences which played somewhat equally 
upon the development of the: members in 
that group. Yet the norm is made single 
for all groups, representing, for example, 
the average proficiency of persons from 
all the socio-economic groups. Rather 
than equalizing the environmental influ- 
ences, this is merely standardizing the 
environmental inequalities between the 


socio-economic groups. To date, the con- 
struction and standardization of intelli- 
gence tests have given a wholly inade- 
quate equalizing control over environ- 
mental influences. 

Some experimenters have attempted to 
meet the problem of environmental con- 
trol by carefully equating the conditions 
surrounding the administration of the 
intelligence test. The best of these at- 
tempts are represented in studies of fos- 
ter children and in studies of twins. 

In studies of foster children and foster 
parents compared with children reared 
with their natural parents, a selected 
group of families is matched according 
to amount of education, occupational 
classification, residential district, and 
similar factors which are associated with 
or could be associated with different 
ranks on an intelligence test. Effort is 
made to obtain large enough groups to 
cancel out by chance the different psy- 
chological reactions while otherwise com- 
parable individuals might have toward 
their social background and toward the 
taking of the test. Much care is exerted 
to eliminate appreciable selective factors 
in the placement of foster children. In 
short, all the environmental control 
which appears reasonably possible to ob- 
tain is sought, and hereditary influences 
are left free to vary as they will. 

What knowledge do we now gain con- 
cerning the influence of native factors 
upon intelligence? The plain facts are 
that fairly large groups of parents and 
their children are known. to correlate 
with each other in IQ rank around .50, 
while the correlation between foster 
parents and foster children is found to 
be somewhat lower, around .2o. This 
certainly indicates that, in terms of rank 
orders on the intelligence test, children 
are more like their natural parents than 
foster children are like their foster par- 
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ents. The difference between an r of .50 
and an r of .20 is large enough to be 
statistically significant, so one can safely 
conclude that the major factor responsi- 
ble for the difference is probably heredi- 
tary influence. Whether that difference 
can be accurately quantified into a per- 
centage of hereditary influence depends 
on the extent to which the differentiation 
between persons provided by the intel- 
ligence test is actually a differentiation 
in native capacity. The grounds for as- 
suming that the differentiation indicated 
by intelligence testing is one of native 
capacity were examined in the preceding 
section and found wanting. Thus, as far 
as any scientific justification is con- 
cerned, there are no decisive grounds 
against repeating the above experiment 
with arithmetic or spelling tests instead 
of intelligence tests, and, if the correla- 
tion is again higher between children 
and their natural parents, giving the 
credit again to heredity. All that is really 
verified in the above experiment is that 
heredity accounts for some part of the 
greater similarity in test intelligence be- 
tween children and their parents over 
foster children and foster parents. Psy- 
chologists with the strongest bias toward 
environmental influences should be will- 
ing to grant this. 

However, questions of major concern 
to the environmentalist in this situation 
remain unanswered. This approach 
throws no light on whether foster homes 
had beneficial or deleterious influence on 
the IQ’s of foster children, nor even 
whether the various kinds of homes of 
natural parents had beneficial or dele- 
terious influence on the IQ’s of their 
children. Consequently, this approach 
does not show what factors might have 
been back of any possible increase or de- 
crease in the intellectual ability of the 
children, either on the part of foster 


parents or of natural parents. But we 
should add that, even if this information 
had been forthcoming, it could not legit- 
imately be interpreted as evidence that 
favorable and unfavorable environments 
actually alter native capacity. 

In studies which compare the resem- 
blance between identical twins and the 
resemblance between fraternal twins in 
terms of rank orders on an intelligence 
test, virtually the same situation prevails 
as above. A greater resemblance is found 
between identical twins than between 
fraternal twins, indicating the operation 
of hereditary factors, but again the con- 
struction of the intelligence test does not 
give scientific grounds for assigning a 
specific amount or percentage to hered- 
ity. Moreover, the environmentalist 
again has cause to object that the possi- 
ble influence on the IQ of controlled 
changes in the environment is not re- 
vealed in this approach. 

In experiments set up to study the 
difference between identical twins reared 
together and identical twins reared apart, 
the intelligence test is asked to play a 
different role. In this case the hereditary 
influences are equalized by a fortuitous 
accident. Since identical twins are the 
product of the division of a single ovum 
and inherit virtually identical physical 
characteristics, they can be safely as- 
sumed to inherit identical mental capa- 
city. Furthermore, identical twins reared 
together afford naturally a better equali- 
zation of environmental influences than 
any yet devised. Any differences between 
them on an intelligence test (or, for that 
matter, an arithmetic or spelling test, 
provided they have really shared all ex- 
periences related to these activities) 
would have to be charged to weaknesses 
in the construction or administration of 
the test. 

For identical twins reared apart, the 
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variable is environmental influence, and 
the intelligence test thus becomes one in- 
dex of the effect of different environmen- 
tal influences on their effective intelli- 
gence. Interest now focuses on how much 
change in IQ can be accomplished by 
rearing identical twins separately (1) in 
various kinds of different environments 
(2) starting at different ages and (3) 
starting at different IQ levels. ‘Those who 
feel that the IQ is a relatively unchang- 
ing ratio for each person and those who 
feel that the IQ is greatly modifiable 
under certain circumstances can get to- 
gether here and settle their differences by 
an appeal to the facts. The only caution 
we should urge is that even these experi- 
ments are not determining the relative 
influences of nature and nurture upon 
test intelligence. Rather they indicate 
how much an intelligence produced by 
nature and nurture up to a certain age 
can be altered in IQ points by various 
changes in the subsequent environment. 

Up to this point we have examined 
two assumptions underlying the pre- 
sumed applicability of intelligence test- 
ing to the nature-nuture problem. The 
assumption that intelligence tests were 
somehow indices of innate ability was 
found, by logical analysis, not to have 
those verifiable foundations required by 
science. In regard to the assumption that 
adequate controls existed over the influ- 
ences of nature and nurture, the prospect 
of adequate control over nature by the 
construction of the test was of course 
eliminated with the first assumption. 
[he process of test standardization was 
also found to be an,inadequate control 
of environmental influences for the na- 
(ure-nurture problem. Considerable suc- 
cess was seen to be achieved in the ex- 
perimental control of the environmental 
conditions surrounding the administra- 
tion of an intelligence test in studies of 


_reared 


foster children and of comparisons be- 
tween fraternal twins and _ identical 
twins, but the significance of this control 
for the nature-nurture problem was 
vitiated by the lack of scientific grounds 
for ascribing to native capacity any defi- 
nite aspect of the performance on the 
intelligence test. Additional control of 
hereditary influences was achieved in 
comparisons between identical twins 
together and identical twins 
reared apart, but in this case heredity 
was merely a constant, and again the in- 
ability of an intelligence test to provide 
a verifiable scale of innate capacity made 
conclusions on the kind and amount of 
influence exerted by nature and nurture 
impossible. 

The obvious keystone in the above 
conclusions is whether intelligence tests 
are indices of a distinguishable quality 
which can be designated as the effect of 
innate capacity. Some investigators may 
concede that they cannot use rigorous 
logic in giving scientific foundation to 
the assumption that intelligence tests do 
provide indices of native capacity, but 
this assumption still amounts to.a convic- 
tion with them. The basis of this per- 
sistent conviction appears to be the re- 
markable constancy they have found in 
the IQ, and to this matter we shall now 
give attention. 


THE CONSTANCY OF THE IQ 


The plausibility of the contention that 
a relatively constant IQ is an indication 
of innate mental capacity depends on 
the acceptance of a certain theory regard- 
ing the way in which hereditary influ- 
ences operate. This theory holds that in- 
nate intelligence, determined by the 
genes, is a constant which no environ- 
ment can alter but which a favorable en- 
vironment can bring out somewhat more 
effectively. In the years up to early ma- 
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turity this capacity develops and reveals 
its original potentialities. Moreover, 
while its growth in persons may not be 
at a constant rate in absolute terms, it is 
assumed to maintain a constant relation- 
ship between persons on any age level. 
Thus, brighter persons start out brighter 
than dull persons and maintain that dif- 
ferential of superiority during the years 
that native capacity is maturing and, 
presumably, throughout life. 

No one questions that an intelligence 
test is an index of some kind of intelli- 
gence. From its results, the IQ is com- 
puted to show a person’s intellectual 
standing in the cultural group of which 
he is a member. An IQ is an index of 
his relative status at that particular time. 
From the point of view of this theory, it 
may also be an index of his maturing 
innate capacity if at least two further 
conditions are-fulfilled. First, the relative 
standing represented by the IQ must not 
change appreciably from year to year as 
the group matures in the same cultural 
environment. Second, even changing the 
environment for some of the group 
should not significantly alter their IQ's. 
On empirical grounds, then, the analogy 
between the behavior of the IQ and the 
assumed behavior of native capacity 
would be complete. One gathers from the 
current dispute over the constancy of 
the IQ as a clue to the nature-nurture 
problem that both the “hereditarians” 
and the “environmentalists” accept in 
common this analogical basis for inter- 
preting their test results. 

A word should be said at this point 
on the contention (41; 49) that the con- 
struction of intelligence tests stacks the 
cards in favor of a constant IQ. The 
argument against the contention is based 
chiefly on the fact that, while tests are 
designed to be reliable indices of present 
intellectual status in the group, there is 


no predetermination in the test construc- 
tion of what that intellectual status 
might be at a later time under different 
conditions. 

The most important argument in 
favor of the contention is based on the 
probability that the very process of ob- 
taining a reliable index at a given time 
makes the continuance of that rank order 
the normal and most likely expectation 
in subsequent testings. The activities 
constituting intelligence tests have been 
selected after a long trial-and-error pro- 
cess so that the normal experiences of 
growing up will actually increase the 
proficiency of all children in succeeding 
age groups in these activities. A corolla- 
tive requirement for these test activities 
is that proficiency in them should not be 
influenced by the usual range of pecu- 
liarities in the environmental background 
of the children—e.g., differences in home, 
school, or occupation. Then a high re- 
liability is sought for the test by repeat- 
ing it, at least in principle, within a brief 
time to see whether the rank order of the 
subjects remains just about the same. 

After this selective process, from what 
source could a change in IQ rank come? 
Presumably it could not come from quit- 
ting school or studying harder or going 
to live with one’s aunt, for great effort 
is made to select test items free from 
these peculiarities. It could not come 
from growing older and thus more pro- 
ficient with the test activities, for every- 
one else is growing older too. It is con- 
ceivable, of course, that each person 
could have a decidedly unique curve of 
maturation, but if that result should 
appear very strongly in the test results, 
doubts would probably be felt that the 
effect of differences in individual envi- 
ronments had been successfully ruled 
out. It could not come, in theory at least, 
from taking the test again, because only 
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highly reliable tests are used. Unless 
one’s normal rate of development sud- 
denly lagged or accelerated, one could 
expect to occupy the same rank in his 
group next year and the year after that 
as he occupies now. Only two factors 
could conceivably accomplish a sudden 
lag or acceleration: (1) an unusual or- 
ganic disturbance, which would be ruled 
out as atypical for the group anyway; 
and (2) a radical change in environment, 
which is very difficult to achieve under 
present experimental controls. Thus, the 
critic charges that the process of intelli- 
gence test construction and standardiza- 
tion makes a constant IQ by far the most 
probable expectation. 

The chief significance of this criticism 
is that it definitely weakens the plausi- 
bility of the heredity hypothesis as the 
only likely explanation of the tendency 
for 1Q’s to remain constant under stand- 
ard environmental conditions. But just 
how constant does the IQ tend to re- 
main in spite of this favorable expecta- 
tion? Even the best available facts on the 
constancy of the IQ within the environ- 
mental range standardized in the test 
norms show only a fair constancy for 
groups as a whole and considerable in- 
constancy for individuals. Under the 
very best conditions of individual test- 
ing, reports Goodenough (35, Pt. I, 358), 
half of the group falling around the mean 
for that age may be expected not to vary 
in IQ more than four to six points over 
a period of six to seven years, but the 
other half at either end of the distribu- 
tion will change status by more than six 
points, a few as much as twenty or more 
points. Moreover, if the usual group test 
is given or if a different, test is used on 
the second examination, even greater 
variations in individual IQ’s will occur. 
Indeed, a study made by Robert L. 
Thorndike in three private progressive 
schools in New York City on over joo 


| 
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children retested after an interval of at 
least two and one half years showed that 
almost 13% gained 20 points or over in 
IQ while almost 4% lost 20 points or 
over (48, 53r). In 1926 Hollingworth 
stated: “Psychologists no longer doubt 
that it is now possible to predict when 
a child is six years old, what his relative 
position will be in the total range of 
intellects when he is sixteen” (68, 158). 
And yet in 1930 A. W. Brown reported 
data which showed that the predictive 
efficiency of the Stanford-Binet, while 
55% better than chance on retest in- 
tervals under a year, was only 20% better 
than chance on retest intervals from five 
to twelve years (36, 735). In the same 
vein, R. R. Brown in 1933 reported an 
average shift for the group of 12.7 1Q 
points over a five to twelve year interval 
(36, 736). 

In view of the range represented in the 
available evidence, the constancy of in- 
dividual status (not group or average 
Status) in test-intelligence over a consid- 
erable period of years is a matter to be 
empirically ascertained for each person 
rather than predicted from cogent 
grounds. The evidence warrants some 
prediction for the group or the average 
but does not warrant prediction of IQ 
rank for any one person over a long term 
except within range so wide as to be only 
a little better than chance. This conclu- 
sion contradicts the first condition for 
assuming, on analogical grounds, that an 
IQ may be a reliable index of a person’s 
maturing innate capacity. The second 
condition was that even unusual changes 
in the environment of part of the group 
should not significantly alter their IQ’s. 
What are the implications of the evi- 
dence on this point? 

One of the main problems seems to be 
what constitutes a significant change in 
the environment in regard to an ex- 
pected influence on intelligence. One 
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expectation is that going without sleep 
for eighty hours or fasting for three 
weeks would be a significant change, but 
the decrement in intellectual perform- 
ance under these conditions has been 
found to be rather small (49, 468). An- 
other expectation is that children who, 
at an early age, had as much access to 
usable environmental stimulation as 
some control group did but who then 
continued to live in a considerably more 
impoverished (or stimulating) environ- 
ment than the control group had would 
show a progressive decrease (or increase) 
in IQ rank up to some point. Various 
studies (36, 738-40) of canal boat chil- 
dren, gypsy children, children of eastern 
Kentucky mountaineers, children in a 
North Carolina mill town, and negroes 
migrating to New York show such an 
expectation to be justified. A third ex- 
pectation is that identical twins reared 
apart in homes of widely different cul- 
tural environments would diverge in IQ 
rank accordingly, and the most famous 
study of twins to date concludes that 
“extreme differences in educational and 
social environments are accompanied by 
significant changes in intelligence and 
educational achievements as measured by 
our tests” (37, 349). A fourth expectation 
is that attendance at nursery school on 
the part of pre-school children would in- 
crease their IQ rank in comparison to a 
control group. This expectation is still 
unsettled in current experimentation. A 
major doubt arises from the known fluc- 
tuations in test results for children below 
four years of age. Most of the studies 
reported show that the IQ’s of nursery 
school children go up, but in most cases 
so also have those of the carefully placed 
control group. 

A fifth expected influence on intelli- 
gence is the socio-economic level of the 
home in which the child grows up. Many 
studies have shown that different socio- 


economic levels have different means in 
IQ, but the question of the constancy of 
the IQ has not been systematically 
studied with socio-economic status as the 
variable. In the twin study of Freeman, 
Holzinger, and Newman, only one set of 
identical twins was reared as far apart 
as the managerial occupational level and 
the slightly skilled occupational level. 
In view of other evidence on the modifi- 
cability of the IQ, one is perhaps not 
unjustified in expecting that thorough 
studies on the influence of socio-eco- 
nomic level may reveal profoundly sig- 
nificant evidence of the further modifi- 
ability of the IQ. 

In summary, we have examined the 
significance of current intelligence test- 
ing for the question of nature and nur- 
ture, and have found neither scientific 
grounds nor analogical, empirical 
grounds for a contribution from this 
source. The incautious assumption, on 
the basis of early test results, that the 
relative constancy of the IQ indicated 
the discovery of an index of native capa- 
city has opened the way for the equally 
incautious assumption that the modifi- 
ability of the IQ under special conditions 
indicates that the potential of native ca- 
pacity can be significantly altered by 
marked changes in the environment. 
The facts are not only that intelligence 
as a native capacity is not known to be 
measured either directly or indirectly by 
tests but also that native capacity cannot 
even be said to exist as a distinguishable 
aspect of intelligent behavior. When we 
actually come to understand the genetic 
composition of human abilities, native 
capacity may be a constant factor upon 
which varying amounts of good environ- 
ment can be piled, or a range of varia- 
bility within fixed limits, or a probability 
mode on a scale from which variations 
in either direction are indefinitely possi- 
ble but with increasing difficulty, or 
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perhaps a fundamentally indeterminate 
factor incapable of independent descrip- 
tion. But the point is that such knowl- 
edge is not likely to come from mental 
testing. 

The typical mental test deals with an 
aspect of behavior, commonly called ab- 
stract intelligence, which is a functional 
concept for the test situation and for 
situations similar to it, like studying the 
conventional school subjects. The IQ is 
a convenient means of ranking a person 
in a standardized group with respect to 
his proficiency in this kind of behavior. 
Che net result of recent studies of the 
1Q seems to be that a person’s rank on 
this kind of behavior, which is only 
somewhat modified by the environmen- 
tal conditions usually encountered in 
growing up in this culture, can be fur- 
ther modified by introducing more radi- 
cal changes in the environment. While 
beyond doubt this is educationally and 
socially significant, it is not the nature- 
nurture question at all. 

We should now have a basis upon 
which to discriminate between the, uses 
and abuses of intelligence tests and the 
[Q. Intelligence tests provide an efficient, 
condensed index, relatively uninfluenced 
by such things as school subjects studied, 
of a person’s proficiency in abstract ver- 
bal activities. They provide a basis for 
studying the relation between this pro- 
ficiency and other socially important 
traits and talents. They can be used as 
a means of gaining insight into a per- 
son’s mental processes. They can be used 
to study what other personality traits 
are associated with high abstract verbal 
ability and low scholarship. They can 
be used to gain clues in the study of be- 
havior problems. Under certain condi- 
tions, they provide a basis for ability 
grouping. They can be used in determin- 


| 


\ 


> 


ing the appropriate jobs qualifications. 

In the realm of prediction, they pro- 
vide a fairly good prognosis of school 
achievement, especially in combination 
with other criteria. Prediction of success 
in other areas is also dependent on the 
persistence of environmental conditions 
and aims similar to those represented 
in the test situation, but such prediction 
can be validated on empirical grounds 
when the probable character of the fu- 
ture environment is known. The use of 
intelligence tests becomes more question- 
able, however, in the study of race and 
sex differences, of psychopathic cases, of 
delinquents and criminals, and of chil- 
dren with speech and hearing handicaps. 
The standardization of test norms quite 
properly does not include, except by 
chance, these atypical groups. Conse- 
quently, direct comparison would be un- 
fair and misleading, but a clinical study 
of the test performance might reveal 
significant clues for remedial work in the 
notable variations from normal re- 
sponses. 

All of these uses of intelligence tests 
may be considered valuable examples of 
educational or psychological engineering 
(1, Ch. III, esp. 54). As such, they are 
properly judged in terms of the criteria 
of engineering or—to use a more com- 
mon word in education—evaluation. The 
place of mental testing in psychological 
engineering, in spite of needed clarifica- 
tion and refinement, is beyond question 
secure. In this study, however, we have 
been examining mental tests solely by 
the criteria of a basic science—the kind of 
science which, as it slowly continues to be 
achieved on many fronts, will provide 
increasingly surer and wider bases for 
psychological engineering. But to this 
task the techniques of mental testing ap- 
pear, so far, wholly unsuited. 





CHAPTER VII 


MENTAL TESTS AND SCIENTIFIC CRITERIA 


N THE last three chapters mental test- 
I ing has been critically examined as a 
type of scientific quantification, as a 
means for verifying psychological 
theories, and as a basis for certain scien- 
tific generalizations about human abili- 
ties. In each instance there appeared a 
number of inadequacies or obstacles 
which prevented the current types of 
mental tests from serving as instruments 
of science. The intention in this chapter 
is not primarily to develop new points 
but to bring together in systematic out- 
line the major reasons why mental testing 
fails to qualify as a technique in the 
science of psychology. While this sum- 
mary will carry enough details and illus- 
trations to stand by itself, a much fuller 
discussion of specific points will be 
found in the preceding chapters. 


THE PROBLEM OF “WHAT WORKS” 


Most mental tests are built to satisfy 
some practical standard of validity, such 
as teachers’ marks or teachers’ judgments 
of academic ability. Considerations de- 
volving from some theory concerning the 
psychological nature of an ability are 
usually of decidedly secondary impor- 
tance. Thus, when the aim is to solve 
particular problems of common occur- 
rence, such as classifying the bright and 
the.dull in a classroom, it is often suffi- 
cient to obtain test conditions which will 
work and continue to work for this de- 
sired end without much regard to which 
of these conditions are critical determi- 
nants for the individual case. For ex- 
ample, in the construction of the usual 
intelligence test, any methods and condi- 


tions are acceptable if they will progres- 
sively distinguish age groups among chil- 
dren and will tend strongly to enable 
children judged bright to score higher 
than children judged average or dull in 
the same age group. The test is presum- 
ably made more valid by standardizing 
the conditions of administration, a proce- 
dure which is intended to equalize moti- 
vation, learning opportunities, response 
opportunities, and so forth. And the test 
is presumably made more useful by 
standardizing the test norms against pop- 
ulations with all varieties of American 
school backgrounds. A test is good to the 
extent it satisfies these aims. In short, a 
good test is one that works. 

Now if we were more interested in a 
science of psychology than in the imme- 
diately practical problem of classifying 
school children according to a certain- 
value-standard, if we were more interested 
in verifying some psychological theory 
or making some precise generalization on 
the nature of human abilities, we would 
probably ask such questions as these: 
To what extent do these test perform- 
ances depends on quantitative differences 
in similar psychological processes? To 
what extent can these performances be 
achieved by qualitatively different psy- 
chological processes? How much do the 
test and its conditions of administration 
favor or restrict the exercise of qualita- 
tively different psychological processes in 
producing the observed performances? 
To what extent is the emergence of cer- 
tain psychological processes inhibited or 
promoted by various environmental fac- 
tors? To what extent is the emergence 
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of certain psychological processes in- 
dependent of any particular environ- 
ment? 

The supremely important purpose of 
asking questions like these is to know 
what conditions are critical determinants 
of certain behavior so that investigations 
can then be instituted to find out the 
effect of deliberately changing any or all 
of these conditions. Since the aim of 
science is general rather than particular, 
precise knowledge of the effect of speci- 
fied conditions is sought so that control 
and prediction of the effects of various 
combinations of these conditions is pos- 
sible. Such knowledge determines the 
range of possible practical purposes— 
e.g., the limits of what one might do with 
certain types of schooling for certain chil- 
dren—and is pertinent in some predic- 
tive way to any of the practical purposes 
within this range. In short, the results 
of science become parts of a rational 
system which not only explains what 
factors are responsible for observed re- 
lationships, but logically predicts new 
relationships under similar conditions 
and under logically altered conditions. 

All of this may be summarized in the 
statement that both science and experi- 
mental evaluation are concerned with 
‘what works” but that in science a more 
refined emphasis is given to what works. 
Che several critical points at which men- 
tal testers do not know “what is work- 
ing’ in their procedures have been men- 
tioned earlier and Will be summarized in 
the following pages. Of course, for most 
of the purposes of testing a_ precise 
knowledge of what is working is not 
necessary at all, but this state of affairs 
is undoubtedly a major reason why men- 
tal testing has not verified any psycho- 
logical theory nor established any scien- 
‘ific generalizations concerning the na- 
ture of human abilities. “ 
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AN OPERATIONAL DEFINITION OF 
“ABILITY” 
A second reason why mental test re- 
sults have not verified a theory of human 
abilities probably lies in the fact that 


‘current ability theories are actually not 


appropriate to the operations of mental 
testing. All these theories are based on 
the assumption that the test perform- 
ance is one thing and that the ability 
which produced it is another thing. The 
structure of each theory deals with the 
presumed organic nature of various 
human abilities. The testing research, 
however, deals only with the perform- 
ance. Of course the ability and the per- 
formance are granted to be related in 
some way, but very few would claim that 
the relationship is perfectly represented 
in any given performance. The problem 
thus becomes, as far as mental testing is 
concerned, how to manipulate the per- 
formances so that they truly represent 
the inferred ability. There is no veri- 
fiable answer to this problem by study- 
ing performances alone. Various logical 
assumptions have been tried, such as as- 
suming that the expressions of an ability 
over a period of time should be normally 
distributed, but they remain merely as- 
sumptions incapable of scientific verifica- 
tion. 

If ability and performance are actually 
distinguishable entities in scientific lan- 
guage, independent approaches to each 
must be possible. One possible approach 
to an ability thus defined might be 
through physiology or neurology. Or 
perhaps something will come of ‘the very 
recent technique of electroencephalog- 
raphy (35, Ch. VI). Another type of 
approach might be through some form 
or modification of experimental psycho- 
analysis. Along this line, the introduc- 
tion of the biological terms of pheno- 
type and genotype into psychological 
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literature may provide the basis for a 
scientific distinction between ability and 
performance. A _ phenotype is some 
culturally-defined activity or experience, 
structurally independent of the pecu- 
liarities of participating individuals and 
expressed in the ordinary “language of 
data.” Hence, it is a phenomenological 
description of behavior, like “being 
honest” or “acting intelligently.”” A geno- 
type is the inclusive, dynamic, causal pat- 
tern underlying various examples of be- 
havior, usually expressed as a hypotheti- 
cal construct with no _ corresponding 
physical reference. 

As an illustration of the significance 
of this distinction, two persons can ex- 
hibit the same phenotype and yet act 
from quite different causes. For example, 
they may both appear very aggressive, 
one because he is trying to cover up em- 
barrassment and the other because he 
habitually enjoys dominating the group. 
On the other hand, two persons may 
exhibit quite different phenotypes and 
yet act from practically the same causal 
pattern. For example, one person may 
appear aggressive and the other appear 
very shy and reserved, both because of 
acute embarrassment. The concept of 
genotype as used in current research is 
not, however, something exclusively in 
the individual. It is a construct which 
takes the person-in-situation as the unit 
and attempts to explain the whole pat- 
tern in terms of a few critical relation- 
ships. 

But these possible approaches have 
little relation to mental testing. The 
mental testers may well question whether 
ability and performance are actually dif- 
ferent entities in any operational sense. 
It could be argued that the two terms 
are distinguishable logically but not 
existentially. Existentially they may be 
merely a hyphenated expression of a 


single relationship—an ability-perform- 
ance. Whatever the argument, the fact 
remains thiat the only operational defi- 
nition of ability possible in the proce- 
dures of mental testing is in terms of a 
critical relationship between a purposing 
individual and a problematic situation 
(i.e., the test). The methodology of men- 
tal testing provides no way of opera- 
tionally defining an ability and a per- 
formance as distinct but related entities. 
If the above definition of an ability is 
accepted as the only one appropriate to 
the operations of mental testing, several 
things are implied. First, this definition 
implies just as much “testing’’ of the 
situation as of the individual. Second, it 
implies that those theories dealing with 
the “internal” nature of human abilities 
are clearly not appropriate to the defini- 
tions and procedures of mental testing. 
Whether other theories based on _ this 
behavioristic definition of ability are now 
needed (e.g., a cogent theory to back up 
Thurstone’s experiments in factor analy- 
sis) is an Open question, depending for 
its answer somewhat on the view one 
takes of the following analysis of the 
scientific prospects of mental testing. 
Even if it is agreed that the conception 
of ability appropriate to mental testing 
is a functional relationship between a 
purposing individual and the proble- 
matic test situation, the operational defi- 
nition of that relationship will be ex- 
pressed in significant part by the method 
of scoring the test results. For example, 
it will have to be assumed that quanti- 
tative variations in the ability-relation- 
ship are some function of the variations 
in test scores, whether this function be 
linear, logarithmic, or some other mathe- 
matical form. Present methods of scoring 
test performances are still very arbitrary. 
Outside of certain corrections for chance 
errors, they are almost entirely selected 
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to suit the convenience of the investiga- 
tor rather than to represent logical de- 
ductions from explicit assumptions. In 
short, cogent grounds have yet to be 
proposed and verified for determining 
what particular relationship, if any, the 
test ability actually bears to the test 
scores. 


THE HOMOGENEITY OF ABILITY 
BETWEEN INDIVIDUALS 


If test results were to provide scien- 
tific generalizations or verification of a 
psychological theory of human abilities, 
the adoption of a strictly operational 
conception of a test ability would be but 
a preliminary step. The next require- 
ment would be that the ability-relation- 
ship thus defined should have unam- 
biguous psychological meaning with re- 
spect to all persons taking the test. 

This requirement was discussed in an 
earlier chapter with respect to test meas- 
urement, but it is just as essential in 
tests used as sampling devices, where 
scores are the product of simple count- 
ing. In this case mental tests are intended 
to report a sample of an ability-relation- 
ship which to some extent can be dis- 
played by all members of the group. The 
question is whether the ability displayed 
by one person on the test is actually com- 
parable to the ability displayed by an- 
other. An analogy will illustrate that 
point. Suppose that the members of the 
group are box cars loaded with produce 
and that the test is a bushel basket for 
scooping out samples of produce from 
each box car. If each box car is loaded 
with wheat of various size and quality 
and if the bushel basket scoops up a 
representative sample from each car, 
then the contents of the box cars can be 
compared in terms of the variations ob- 
served in the samples. But if it is not 
known what kind of produce the box 


cars contain—if, for instance, some con- 
tain wheat while others contain potatoes 
—it would be ridiculous to compare the 
size and quality of a bushel basket of 
wheat with the size and quality of a 
bushel basket of potatoes. Obviously, a 
meaningful comparison of samples de- 
pends upon independent knowledge that 
the box cars contain the same kind of 
produce. 

In mental testing what assurance do 
we have that passing 100 items in a given 
test represents comparable samples of 
ability between members of the group? 
Any assurance would have to rest upon 
independent knowledge of such factors 
as similar difficulty overcome, similar 
purposes or motivation, and similar psy- 
chological processes employed. In short, 
the ability being sampled must be homo- 
geneous among all members of the 
group, varying only in degree, quality, 
or amount. Otherwise test scores would 
be scientifically incomparable. There- 
fore, let us briefly review what is in- 
volved in establishing the psychological 
homogeneity of test-ability. 

In current testing the meaning of test 
scores is based primarily on some group 
definition of difficulty. When difficulty 
is defined as the ratio of successes and 
failures on a series of items by the group 
as a whole, any individual’s score can 
indicate only how well that person does 
in terms of a value standard arbitrarily 
defined by the group. Some other stand- 
ard, such as a critical score for admission 
to some job or some school, could be 
used just as logically. The choice of any 
of these standards of comparison depends 
primarily on the practical purposes of 
the test. When the definition of difficulty 
is based on the distribution of scores 
(preferably a normal distribution), the 
linear continuum on which the distribu- 
tion is expressed is not actually known 
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to be anything more than a range of 
“goodness” of some accepted kind, like 
intellectual goodness or arithmetic good- 
ness. The range of goodness is homo- 
geneous, so far as we know, only from 
the standpoint of social value—e.g., col- 
leges prefer the top ranking applicants 
in intellectual goodness for admission, 
or engineers prefer assistants who stand 
out from the crowd in degree of mathe- 
matical goodness. 

No one would deny the usefulness of 
these meanings of test scores, but a scli- 
ence of psychology is concerned with 
other meanings. Do two identical scores 
mean that the same kind and amount of 
psychological processes were employed? 
Do they mean similar sociological back- 
grounds of experience? Do they mean a 
qualitatively similar adaptation to the 
immediate test environment? Do they 
mean that comparable amounts of 
psychic tension were built up or that 
similar amounts of nervous energy were 
expended? The consequences of having 
such knowledge are manifold. If we 
could attribute various performances to 
the influence of several unambiguously 
identified and quantified factors, we 
could then undertake scientific studies 
of heredity and the effect of various en- 
vironments on the combined expression 
of these factors. By knowing precisely 
what factors accounted for differences in 
performance between persons, we could 
very likely predict such things as what 
effect a specified change in environment 
would have on those ranking low in 
performance goodness and also what ef- 
fect certain kinds of environment would 
have on the later performances of those 
now scoring high in performance good- 
ness. 

The achievement of such scientific 
meanings as these from the current 
methodology of mental testing is prob- 


ably too much to expect, for test results 
at present are notoriously ambiguous in 
what they signify about the socio-psycho- 
logical ingredients of the recorded per- 
formances. If the psychological processes 
in various performances are to be made 
unambiguous, individuals would have 
to be studied rather than total groups, 
and this means that the experimental 
operations in testing would have to be 
valid for individuals. One of the major 
conditions required for valid individual 
testing of this type would almost unes- 
capably be the achievement of an opera- 
tional concept of individual difficulty in 
place of the current group concepts. 

A second condition to the establish- 
ment of the homogeneity of the ability 
tested is to insure that repetitions of the 
same test are dealing with the same 
ability-relationship. For this, it is neces- 
sary to know either that the repetition 
of the test does not involve any quali- 
tative change in the ability—i.e., the 
ability may be different only in quantity 
on a repetition of the test—or if intelli- 
gent adaptation has entered in and 
qualitatively changed the relationship, 
the extent of the qualitative changes 
should be scientifically ascertainable and 
corrected for in the results of the test 
repetitions. As yet neither of these alter- 
natives is subject to scientific control. 
When an attempt is made to equate 
the results of two different mental tests, 
this deficiency becomes even more acute. 
In order to make two similar tests scien- 
tifically comparable, it is necessary at the 
very least to know that the two tests are 


_psychologically equivalent at all levels 


and that they are equal in difficulty at 
any one level. By these criteria, not even 
such carefully constructed tests as the 
Stanford Revision and the Kuhlmann- 
Binet are interchangeable “without dis- 
tortion of the facts” (35, 378). 
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A third condition is the elimination of 
the effect of variations in purpose or 
motivation. Very recently Lund (35, Ch. 
10) has summarized the considerable re- 
search on the effect of emotions on the 
direction of mental activity and on the 
quality and level of test performances. 
Yet all mental tests rest on the assump- 
tion that both the kind and the degree 
of motivation, between individual per- 
formances are held constant. The only 
justification offered for this assumption 
is that the test environment and test 
instructions are “standardized.” While 
this standardization is certainly better 
than nothing for many practical pur- 
poses, the most careful experimenters 
recognize its inadequacy for science. 
Thurstone, for example, would like to 
eliminate the element of motivation en- 
tirely. “Other fields should be explored 
in the hope of eventually obtaining 
methods of appraising individuals more 
or less independently of their efforts to 
perform well on particular testing oc- 
casions” (56, 233). Whether this can be 
done without serious distortion of the 
psychological nature of ability remains 
an open question. 

A fourth condition of establishing the 
homogeneity of the ability tested would 
be to verify the assumption that this 
ability-relationship is qualitatively the 
same for all persons, and that its varia- 
tions between persons is consequently 
only quantitative. At present we know 
only that a test classifies persons accord- 
ing to some standard, not what it classi- 
fies or differentiates between. The test 
classification can be given added prac- 
tical significance by establishing correla- 
tions with other kinds of performances, 
but such correlations provide no con- 
clusive evidence on precisely what | is 
being correlated. The emphasis is on 
getting the test results to work—e.g., to 
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make useful predictions for similar con- 
ditions—rather than on what is working. 
We still do not know whether fairly re- 
liable correlation coefficients are due to 
the constancy of factors in the psycho- 
logical processes of the testees, in their 
sociological backgrounds, in the con- 
struction of the test itself, or in some 
combination of all three. Consequently, 
we also do not know whether the effec- 
tive factors—be they psychological, socio- 
logical, or characteristic of the test—are 
common to all the testees in producing 
the observed performances or whether 
qualitatively different factors operate in 
various testees to produce the observed 
performances. Our ignorance in this re- 
spect means ignorance in regard to how 
various factors could be systematically 
changed for other desired results. 

Now let us suppose that some investi- 
gator has assumed that at least some 
of the factors producing a set of per- 
formances are essentially psychological 
and that he wishes to know whether 
these psychological factors are common 
in some amount to all members of the 
group. Very probably the following four 
propositions would have to be verified: 
(1) that the test presents the same prob- 
lem to all the testees—i.e., each trustee 
sees the same kind of task to be accom- 
plished; (2) that obtaining the right an- 
swers involves the same kind of psycho- 
logical process from each testee—e.g., the 
genotypical basis of the performance is 
the same or made the same for each 
testee; (3) that if a testee fails to give 
the right answer, the reason is that he 
is actually unable to attain the particular 
psychological process required; (4) that 
either the same quality of psychological 
relationship persists from the accom- 
plishment of easy items to the accom- 
plishment of hard items for each person, 
or any qualitative changes are known 
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and accounted for. Any one acquainted 
with the problem of test construction 
and administration will recognize how 
exceedingly difficult, if not impossible, 
these requirements would be to fulfill. 
These requirements are not new, how- 
ever, they have been detected and de- 
scribed by other writers since the early 
days of the testing movement (19, 78; 
43, 31; 62, 162).° But they have seldom 
been seriously faced, chiefly because most 
tests are used for practical problems of 
experimental evaluation where the iden- 
tification and control of the components 
of the performances is usually unneces- 
sary. When they have been seriously 
faced, the chief effort has been directed 
toward circumventing them in some 
plausible but unverifiable way. Prac- 
tically no attempts have been made 
actually to fulfill these requirements, 
probably because psychologists have 
lacked adequate tools (both physical and 
logical) of control. 


THE LOGICAL USE OF NUMBERS 


Another point at which mental testing 
results have failed to meet the criteria of 
a science is that certain practical con- 
siderations have led virtually all testers 
to depart from careful adherence to the 
logic of numbers. One of the major 
themes of Chapter III was that the em- 
pirical data of science must not be dis- 
torted to fit a preferred logical form but 
that a logical form should be found 
which is appropriate to the “contours” 


of the data. Only in this way can the. 


verification of a final deduction also in- 
directly verify the preceding chain of 
assumptions and hypotheses and thus 
build a sound scientific theory. Accord- 


* A theory of human behavior which appears to 
solve the difficulties noted above but which does 
not yet provide a scientific role for mental tests 
has been recently advanced by Snygg (45). 


ingly, careful attention to how well data 
fit the postulated conditions of transitiv- 
ity, asymmetry, ordinal numbers, and 
cardinal numbers is essential in order to 
sustain the logic of scientific inquiry and 
to give existential meaning to the prod- 
ucts of mathematical manipulation. 
But in much mental testing, data that 
logically fit one kind of number are often 
treated as though they fit another kind 
of number. For example, a person’s men- 
tal age as derived from an intelligence 
test norm is generally admitted to be an 
ordinal number. A person’s chronologi- 
cal age is commonly considered to be a 
cardinal number on the basis that one 
year is approximately as long as any 
other, although the fact that chronologi- 
cal ages in this connection are viewed as 
years of increase in maturation instead 
of lengths of time gives strong support 
to the contention that chronological ages 
are also ordinal numbers in intelligence 
testing (63, 25). But in spite of the pres- 
ence of at least one ordinal number, the 
two are divided and multiplied by 100 
to obtain an IQ. This of course violates 
the formal properties of numbers, be- 
cause only cardinal numbers are logically 
capable of division and multiplication. 
Another example appears in the deri- 
vation of correlation coefficients. ‘The 
raw scores on an intelligence test are the 
sums of counting, in which cardinal 
numbers are used because the test items 
are distinguishable but similar members 
of a group. A common treatment of these 
raw scores is to plot their frequency of 
occurrence on a base line running from 
the lowest to the highest score. So far as 
the investigator knows, the base line is 
not a continuum of increasing intelli- 
gence but merely a collection of discrete 
totals arranged in an ascending order. 
Moreover, the intervals between the 
scores are made equal arbitrarily for the 
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mathematical convenience of plotting; 
they are not equal in “muchness” of any 
quality. But the scores of a group on 
two different tests are usually laid out 
in this fashion for the purpose of cor- 
relation. 

The correlation technique most fre- 
quently used is Pearson’s product- 
moment method, but this technique is 
based on two logical premises: that the 
base line is a continuum and that the 
data on the continuum are described in 
equal intervals of increasing or decreas- 
ing amount. A further requirement, in 
order to give stable meaning to the cor- 
relation coefficients, has been that the 
two distributions of scores should. be 
normal, because “variable degrees of 
skewness in distributions to be correlated 
affect the size of the coefficient of cor- 
relation to a degree not precisely known” 
(36, 734). This last requirement, how- 
ever, may be avoided in the future, ac- 
cording to recent statistical research (18, 


29-43; 13, 675-701). But the first two ~ 


logical premises are not fulfilled by the 
test data except by an unverified and 
probably unverifiable assumption. The 
assumption is still made in the treatment 
of test results and this correlation tech- 
nique is then used, but the logic of 
scientific inquiry is broken and the con- 
clusions reached are scientifically root- 
less. 

Another technique less often used is 
the rank correlation method, but its use 
also assumes not only a continuum but 
equal intervals between each rank when 
plotted on the continuum (23, 191). 
Occasionally the attempt is made to cir- 
cumvent the logical requirement of 
equal intervals by converting the distri- 
bution of raw scores into “equal” sigma 
units, but an earlier chapter pointed out 
the purely hypothetical nature of the 
equality achieved in this way. Hence, 


| 


from the standpoint of the rigorous logic 
of scientific inquiry, mental test results 
are not submissible to correlational tech- 
niques. 

The great majority of mental testers 
have not been bothered by this violation 
in logical procedure. Their chief con- 
cern has been with developing a testing 
procedure which would work better than 
individual judgments in certain prac- 
tical situations, and as long as the test 
worked without seriously contradicting 
other sources of information, knowledge 
of precisely what was working could be 
foregone or postponed. Moreover, the 
application of correlational techniques 
to their data, even though strictly il- 
logical, often was very helpful in sug- 
gesting: new hypotheses to explore in 
further experiments. Of course, these 
hypotheses were not logically deduced, 
as from a scientific theory, but they were 
nevertheless suggested to the imaginative 
minds of the experimenters and occa- 
sionally led. to important and highly 
useful discoveries. The limitation on this 
treatment of test results—i.e., that we do 
not know with any great refinement or 
rigorous deduction just what is working, 
—need only qualify rather than deny 
further investigations. If this limitation 
is carefully borne in mind when test re- 
sults are correlated, the fruitfulness of 
tests for the practical problems of psy- 
chological engineering can no doubt still. 
be increased. 


AN ATOMIC VS. AN ORGANIC SYSTEM 


The basic assumption of all mental 
test research which has presumed to 
make contributions to a science of psy- 
chology has been that the ability-rela- 
tionship, or its components in the case 
of factor analysis, operates in an atomic 
system rather than in an organic system. 
In effect, this means that an ability may 
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be relatively isolated and _ statistically 
treated independently of other abilities. 
This assumption has been worth a trial 
because atomic systems are much easier 
to control experimentally and to manip- 
ulate mathematically. The past twenty 
years have seen elaborate and intensive 
research conducted on this assumption. 
But the prospect at present seems to be 
that this assumption has very nearly ex- 
hausted its promise and that there is 
increasing justification for assuming that 
human abilities operate in an organic 
system. 

If mental testing were to be turned to 
the exploration of the organic assump- 
tion, the consequent radical revision 
which would be required in testing pro- 
cedures probably goes far beyond what 
anyone has anticipated as yet. The tech. 
nical dificulties of the change, however, 
are not the only deterrent. Probably a 
change in the philosophy of the mental 
testers would also have to occur, and 
this point calls for some explanation. 

In the early days of mental testing, the 
opponents to the movement were the 
philosophical idealists, who were indig- 
nant at the proposal to quantify any- 
thing so intangible and transcendant as 
the mind. They were answered in a 
classic statement in 1922 by Edward L. 
Thorndike (33, 1), the leading exponent 
of testing at that time. He argued that 
anything that exists, exists in some 
amount, and anything that exists in some 
amount can be measured. To this his 
disciples added the corollary that any- 
thing claimed to be intangible was either 
poorly defined or did not exist. This is 
an excellent statement of the position of 
philosophical realism, which holds that 
the universe consists of a host of “‘reals,” 
things-in-themselves, which exist quite 
independently of and prior to any knowl- 
edge we may have about them. They 


are known as artichokes are eaten, by 
peeling off the leaves of meaning. Men- 
tal testers adopted, consciously or un- 
consciously, this view of the traits which 
common sense assigned to people, and 
set out to gauge their dimensions. 

This bias is still largely characteristic 
of mental testing today. While there is 
increasing acceptance of the proposition 
that mind is a functional aspect of be- 
havior, some testers still speak of it as an 
entity with a structure and dimensions 
all its own. For example, Thurstone ob- 
served in 1940: “It seems difficult to con- 
ceive just how individual differences in 
particular aptitudes can exist at all un- 
less mind has some kind of structure 
that can be described by a system of in- 
dependent parameters” (56, 217). His 
postulates in factor analysis are of the 
same character. A defense of this view- 
point with explicit reference to the as- 
sumptions of philosophical realism has 
recently been advanced by Breed (2, 
118-29). 

Modern opposition to this view is now 
offered, not by the idealists, but by the 
pragmatists or experimentalists, who 
lead the Progressive Education move- 
ment. Drawing upon the psychological 
predilections of John Dewey and the 
findings of Gestalt psychologists, they 
have propounded an organic or organis- 
mic point of view at sharp variance to 
the position of the realists. Their argu- 
ment, so far as it has implications for 
testing practices, may be described as 
follows. 

Organisms adapt themselves to the en- 
vironment. Different organisms adapt 
themselves to the same stimulus by re- 
sponses that are both quantitatively and 
qualitatively different. The more ex- 
perienced and intelligent the organisms 
are, the more possibilities there are for 
qualitatively different behavior. Such 
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complexity of reaction is seldom, if ever, 
found in the physical sciences. In the 
case of temperature, for example, heat 
“adapts” itself to a thermometer, so far 
as we can determine by experimentation, 
essentially only in a quantitative way. 
Such qualitative differences that have 
been found to exist, such as humidity 
and altitude above sea level, are either 
excluded from the testing or corrected 
for in the results according to verified 
laws which express their effect. Even the 
distortion caused by the introduction of 
the thermometer into the system to be 
measured is experimentally controlled. 
Under such conditions, the reporting of 
only quantitative variations is practically 
assured. 

But organisms, in this case human be- 


ings, can adapt themselves to stimuli like , 


a test in many qualitatively different 
as well as quantitatively different ways. 
These differences in ways of responding 
not only may apply between persons or 
between groups but also are just as likely 
to apply between repeated encounters of 
the same person or same group with the 
test. The reason is that persons are able 
to profit from past experience and meet 
this situation with intelligent insight. 
Indeed, intelligence might be defined as 
the capacity for qualitative variation of 
response to a given stimulus. 

In this conception of human behavior, 
an organic system of ability-relationships 
is obviously implied, From this view- 
point, any attempt to establish scientific 
conclusions from test results by treating 
an ability as an isolable variable, inde- 
pendent of other abilities, would be orig- 
inating from a basic error. In other 
words, “Research that tries to discover 
the consequences of a single change, 
‘other things being equal,’ is making a 
false assumption at the start. Change one 
thing, and all other factors will eS 


ably not remain equal. Some will be 
altered more and others less, but an up- 
set at one point will change the distri- 
bution of forces all over the psychologi- 
cal field” (1, 119). 

But the viewpoint of pragmatism is 
not only critical of the current postulates 
of mental tests as presumably scientific 
instruments but is also critical of their 
similar assumptions as instruments of 
psychological engineering or evaluation. 
Accordingly, the Progressives in educa- 
tion have been strong supporters of the 
recent trend toward evaluation proce- 
dures which both deemphasize and go 
far beyond so-called measurement by 
standardized tests (17, 299). This by no 
means indicates that psychological tests 
are to be rejected as evaluation instru- 
ments by the pragmatists. It does mean, 
however, that past methods of scoring 
test results, establishing validity, and de- 
termining test aims will probably be 
significantly altered in order to fit the 
postulates of the pragmatic view. In the 
long run, perhaps, pragmatists may make 
even wider use of testing procedures in 
education and psychological engineering 
than has been the case so far. 

To a few who are consciously philo- 
sophical realists this pragmatic trend is 
a calamity and threatens to sound the 
death knell of a science of education (2, 
119). Most test builders, who are not ac- 
customed to associating philosophical is- 
sues with their work, are probably a little 
confused by this emerging conflict. Many 
of them are undoubtedly pragmatic in 
most matters and consider themselves 
disciples of Progressive Education, but 
have not yet had occasion to examine the — 
basic assumptions of their present pro- 
cedures in testing. They have probably 
inherited the assumptions of the realist 
from the early days of testing, and to 
some of them it may still be the “com- 
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mon sense” view. But they should look 
now to the philosophical foundations of 
their construction and use of tests, for 
they may be accepting uncritically a spec- 
ulative philosophical viewpoint which 
they would not choose to defend. 
What of the future of mental testing? 
The assumption that certain mental 
tests, built and administered on the 
atomic premises of philosophical realism, 
are characteristically scientific instru- 
ments in identifying and measuring 
fundamental human capacities and pow- 
ers has been found wholly unverified in 
the preceding chapters of this study. The 
prospect that mental tests can eventually 
become scientific instruments on the 
premise of the atomic organization of 
human abilities is exceedingly doubtful, 
chiefly because the premise involves a 
large number of corollative assumptions 
which virtually defy present means of 
experimental verification. Whether a 
shift to organic premises will create a 
role for tests and testing procedures ap- 
propriate to a science of human abilities 
also appears rather unlikely at this stage. 
From the pragmatic viewpoint, at 
least, this conclusion enhances rather 
than detracts from the genuine useful- 
ness of tests. In the past, the practice 
of attributing scientific status to mental 
testing has been responsible not only for 
many of the abuses of testing and test 
results, but also for many of the theoreti- 
cal contentions that educational aims 
and procedures could and should be 
“scientifically” isolated, analyzed, and 
blue-printed (2, 118-23; 38, 81). Tests are 
unquestionably valuable aids to the edu- 
cator and the psychologist, but now the 
range and degree of their usefulness can 
be said to depend on how clearly and 
accurately they are seen as engineering 
instruments for purposes of evaluation. 
This monograph has attempted a 


fundamental study of tests as appropriate 
instruments for a science of psychology, 
and now a fundamental study is needed 
of tests as appropriate instruments for 
psychological engineering. Such a study 
should not only systematically explore 
the many present and potential uses of 
tests in evaluation, but also contrast the 
differing consequences of realism and 
pragmatism as theories of evaluation. 
Only then will the research student and 
the practicing teacher have a comprehen- 
sive basis upon which to use psychologi- 
cal tests with confidence, efficiency, and 
discrimination. 


CONCLUDING SUMMARY 


The principal aim of this study has 
been to make a systematic analysis and 
criticism of the extent to which recent 
mental testing is a form of scientific 
measurement or contributes to a basic 
psychological science. One chapter de- 
fined science, considered as a body of 
knowledge, as consisting of those propo- 
sitions and laws, directly or indirectly 
verifiable, which not only describe the 
constant relations between particular 
events but which are so generalized or 
fundamental that they are independent 
of any one set of events and apply to all 
cases of this kind of events. In order to 
possess cogent grounds for the compre- 
hensive explanation of known facts and 
the prediction of likely but unverified 
propositions, the knowledge of a science 
is commonly organized in one or more 
scientific theories. Another chapter out- 
lined the formal requirements of quan- 
tification, with special reference to scien- 
tific measurement and ranking. 

In the analysis of mental testing as a 
form of scientific measurement, the key 
dimension for quantification was found 
to be “difficulty overcome.” The follow- 
ing principal points were developed to 
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demonstrate that scientific quantification 
is not being achieved with this dimen- 
sion by mental tests: (1) Only group 
conceptions of difficulty are used in cur- 
rent test construction, and these group 
conceptions are not equivalent to “diff- 
culty for the individual” either on oper- 
ational grounds or on experimentally 
verified grounds. (2) The concept of difh- 
culty is not experimentally established 
as a homogeneous continuum. (3) Equal 
units have not been achieved, even with- 
in some known margin of experimental 
error, for measuring amounts of human 
abilities. The use of the normal curve 
of distribution cannot provide equal 
units independently of and previous to 
its plotting on a graph. 

Thus, the major conclusion was that 
mental testing fails to provide scientific 
measurement of human abilities, not be- 
cause its units are merely approximate, 
but because its results are so far incap- 
able of fulfilling the essential logical con- 
ditions of scientific quantification. How- 
ever, current concepts of difficulty may 
well serve as convenient standards for 
evaluating the worth of certain perform- 
ances. Against such standards, test per- 
formances can be meaningfully enum- 
erated and ranked (though not meas- 
ured). 

In regard to other possible contribu- 
tions of mental testing to a science of 
psychology, the major conclusion was 
that mental testing fails to verify any of 
the proposed scientific theories about 
human abilities, not only because test re- 
sults are so far inconclusive, but also 
because current procedures in test con- 
struction and interpretation are not sub- 
missible to the rigorous logic required in 
a scientific theory. Many of the most 
critical assumptions made about the na- 
ture of human abilities in the construc- 
tion and interpretation of mental tests 


are not verified and are usually not veri- 
fiable in present testing procedures. 
Moreover, it appears quite likely that the 
current atomic theories of human abili- 
ties, associated with mental testing, are 
not adequate to account for all of the 
significant facts, and that an organic 
theory is needed before a science of 
human abilities will be realized. 

The principal methodological reasons 
why mental tests have not made direct 
contributions to a science of psychology 
include the following: (1) The widespread 
tendency to regard “ability” as an or- 
ganic, psychological entity distinct from 
the performance it produces, thus re- 
moving it from the possibility of scien- 
tific control and description through 
mental tests alone. (2) The lack of an 
operational definition of performance- 
difficulty suitable for use with individ- 
uals instead of with groups only, as is 
now the case. (3) The fact that test pro- 
cedures do not reveal discriminatingly 
and decisively “what is working” to pro- 
duce the observed performances. (4) The 
lack of experimental controls which 
would guarantee that the “ability” being 
tested is psychologically homogeneous 
between persons and between repeated 
testings of the same person. For example, 
do two identical scores mean that the 
same kind and amount of psychological 
processes were employed? Does the test 
present the same problem to all the 
testees? (5) The lack of cogent, logical 
grounds for determining the appropriate 
scoring method to describe scientifically 
the test performance. (6) The widespread 
use of statistical forms when the raw 
data do not fulfill the formal require- 
ments for such logical manipulation. 

In sum, mental tests do not achieve a 
quantification of abilities sufficiently 
rigorous and unequivocal for scientific 
laws or generalizations, and in the non- 
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metrical area they do not verify, either 
directly or indirectly, the chief assump- 
tions made about the nature of human 
abilities in the construction and inter- 
pretation of these tests. As a result, men- 
tal tests have made no positive contri- 
butions to a scientific theory of human 
abilities, and do not appear very likely 
to do so. In fact, no cogent scientific 
theory regarding human abilities has yet 
been devised which includes critical 
hypotheses capable of verification or 
falsification through mental testing pro- 
cedures. These conclusions have direct 
implications for such controversies 
among psychologists as the perennial 
nature-nurture debate, for these conclu- 
sions indicate that it is not possible to 
say (1) that mental tests provide scien- 
tific indices of innate intellectual ability 


or any means of distinguishing between: 


the effects of innate capacity and en- 
vironmental influences on a person’s abil- 
ity; (2) that mental test results provide 
adequate scientific grounds for assigning 
any certain percentage of the factors pro- 
ducing a person’s performance or his 
rank in a group to the influence of native 
endowment; (3) that mental test results 
provide adequate scientific grounds for 
asserting that the influence of a person’s 
native endowment remains relatively 


constant up to maturity or throughout 
life; or (4) that mental test results pro- 
vide adequate scientific grounds for as- 
serting that radical changes in the en- 
vironment can significantly and perma- 
nently alter the influence of a person's 
native endowment upon his subsequent 
behavior and his intellectual rank in a 
group. 

Probably the chief scientific contribu- 
tion of mental testing is that the move- 
ment has led to the abandonment, at 
least in this country, of the introspective 
approach to psychology and the substi- 
tution of a behavioristic approach. But 
though mental tests appear unsuited to 
the discovery of those generalized facts, 
principles, and laws characteristic of a 
basic science of psychology, they are very 
useful diagnostic instruments for imme- 
diate, practical purposes, like grouping 
pupils or selecting the best applicant for 
a job. This is the realm of psychological 
engineering or evaluation, where _par- 
ticular and changing purposes call for 
corresponding types of tests. Because of 
the great contributions which have al- 
ready come and which may still be ex- 
pected from mental tests in this area, 
there is urgent need for a thorough re- 
examination and systematic definition of 
mental tests as instruments of evaluation. 
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